Classification of Handwriting Vectors (Writer Identification)


Data Mining

Data mining algorithms are applied in order to find similarities among data. In the eNoteHistory project, there are many feature vectors between which the aim is also to find similarity. A feature vector consists of n features (about 80) and each feature represents an axis in an n-dimensional space. After adding the feature values for each feature axis, a feature vector is represented by a point in this n-dimensional space. The aim is to compute the distance between these points, the distance between the corresponding feature vectors. For this mean, a distance function calculates a scalar value for each pair of feature vectors. The choice of the distance function depends on the type of input data, its data types and the given problem.

Description of the Distance Function

In the eNoteHistory project, the distances between feature vectors are calculated by means of the "Hamming-Distance" because it achieves the best results for the given problem. First of all, an evaluation function is needed to map the distance between two features to a value between 0 and 1. This evaluation function depends on the feature's type. In one case, a simple value comparison (equal or not) is sufficient. In other cases, the distances between all values of a certain feature are stored in a distance matrix that was heuristically created by musicologists. In case the comparison between two features must be computed using more than one level, the distances for each level will be determined, afterwards added and the relationship between the levels obtained from another distance matrix will be taken into account. Two values whose paths are different in only one level are relatively similar and will receive a low distance. Distance matrices are more efficient because they need calculation time in the beginning, but no calculation is required during runtime.

After determining the distances for each feature pair, the average of all distances will be calculated. Each feature will additionally get a weight, which represents the importance, namely its relevance.

If the distance between two feature vectors is low, then the handwriting characteristics they stand for are similar. Using clustering methods, the existing feature vectors can be separated into different classes that represent similar handwriting characteristics and at best, exactly one writer. In the eNoteHistory project, the classes are already known and the task is to classify new given feature vectors. We used the k-nearest-neighbor classification method that categorizes existing feature vectors using the "Hamming distance" and it learns with each step. Detailed and formal descriptions about the distance function can be found in the Master Thesis (in German) of Lars Milewski.

Procedure During Querying

When a user is about to identify an unknown handwritten music score, the user must manually determine the handwriting characteristics of their music score and then submit them to the system. All determined features of the corresponding handwriting form a feature vector. In order to identify the writer for the submitted query feature vector, the distances to all feature vectors in the database are calculated using the classification method k-nearest-neighor. All existing feature vectors in the database are selected if their distance to the query feature vector is less then the specified threshold. These selected feature vectors represent the most similar handwriting characteristics for the unknown handwritten music score. Afterwards, the writer who corresponds to the most feature vectors in the selected set of feature vectors will be identified. In case the number of belonging feature vectors for two writers are equal, the writer, whose feature vector has the lowest distance to the query feature vector, will be taken. In the end, a sorted list of writers will be displayed. The following graphic shows the procedure for extracting and classifying handwriting characteristics.

Evaluation of the Distance Function

In order to achieve the best results for writer identifications, it is necessary to evaluate the applied distance function and its parameters. To serve this purpose, a scoring function for instance based clustering methods is used to measure cluster similarity. The distances between cluster instances shall be as small as possible, however, one tries to maximize the distance between instances from different clusters. The K-measure attempts to evaluate the system based on classification aspects. The resultant set of writers to be identified will be evaluated to determine whether it contains the correct writer. Thus, K is the ratio of all queries in which the writer was correctly identified.

Choice of Distance Function

We have tested and evaluated several distance functions. As previously mentioned above, in our writer identification system, no clustering methods are necessary to compute the clusters because all writer classes are already known. In order to evaluate the quality of the distance function and its parameters, a clustering algorithm is nevertheless executed on the already existing feature vectors. Doing so, allows us to see whether the result delivers exactly the already known classes. In this procedure, the "Haming distance" measured by the scoring function and the K-measure showed the best results.

Optimization of the Threshold

The scope of the result set of feature vectors depends on the threshold. Precision and Recall were used in order to meet the optimal suited threshold. Precision is a measure for quality of a result set and represents the proportion of feature vectors that are really relevant. In contrast, Recall is a measure of quantity of a result set and stands for the proportion of expected feature vectors which are really included in the result set. An optimal threshold was 0.13 with a Precision value of 85% and a Recall value of 70%.

Weight of Features

While calculating the distance between feature vectors, each feature receives a weight which represents the relevance of a feature in the distance calculation. Thus, these weights influence the distance between two feature vectors. In order to optimize a certain feature weight, one calculates and analyzes the distance without taking the corresponding feature into account. If the result has been improved, then the corresponding feature does not have much relevance to the distance calculation like its weight represents. In case the result has been worsened, the corresponding feature does indeed have more relevance and one must consider whether to increase its weight.

Furthermore, the distribution of the feature weights must be examined as well. To find a good compromise between a good scoring function and a good K-measure, a linear weight distribution in which the features have different weights increasing arithmetically, fits best for this purpose. In our prototype, we have taken a mixture of weights based either on the knowlegde of musicologists or on tests for finding weights.

Distance Matrices

The values being stored in the distance matrices define the similarities as well as the differences between two feature values. These values determined by musicologists also need to be optimized. The results could be improved a little while decreasing the original values. In our prototype, the orignal values were retained in most instances.

NULL Values

We classify NULL values for features into two different NULL value types:

Normally, the user cannot distinguish between these two different NULL values. Therefore, the decision concerning what type of NULL value is made automatically. Regardless, the results are better if as few NULL values as possible reside in the query feature vector.

Besides the distance, an ND-value is also calculated. The ND-value represents the proportion of features which are taken for the distance calculation. Since NULL values have influence on the results, the distance is more reliable, the higher the ND-value is.

Due to these optimization tests, an optimal configuration of the distance function and its parameters as well as those of the system could be found. Thus, we could improve our writer identification rate, increasing it up to 90%.