Classification of Handwriting Vectors (Writer Identification)
Data Mining
Data mining algorithms are applied in order to find similarities among data. In the eNoteHistory project, there are many feature vectors between which the
aim is also to find similarity. A feature vector consists of n features (about 80) and each feature represents an axis in an n-dimensional space. After
adding the feature values for each feature axis, a feature vector is represented by a point in this n-dimensional space. The aim is to compute the distance
between these points, the distance between the corresponding feature vectors. For this mean, a distance function calculates a scalar value for
each pair of feature vectors. The choice of the distance function depends on the type of input data, its data types and the given problem.
Description of the Distance Function
In the eNoteHistory project, the distances between feature vectors are calculated by means of the "Hamming-Distance" because it achieves the best results for
the given problem. First of all, an evaluation function is needed to map the distance between two features to a value between 0 and 1. This evaluation
function depends on the feature's type. In one case, a simple value comparison (equal or not) is sufficient. In other cases, the distances between
all values of a certain feature are stored in a distance matrix that was heuristically created by musicologists. In case the comparison between two features
must be computed using more than one level, the distances for each level will be determined, afterwards added and the relationship between the levels
obtained from another distance matrix will be taken into account. Two values whose paths are different in only one level are relatively similar and will
receive a low distance. Distance matrices are more efficient because they need calculation time in the beginning, but no calculation is required during
runtime.
After determining the distances for each feature pair, the average of all distances will be calculated. Each feature will additionally get a weight, which
represents the importance, namely its relevance.
If the distance between two feature vectors is low, then the handwriting characteristics they stand for are similar. Using clustering methods, the
existing feature vectors can be separated into different classes that represent similar handwriting characteristics and at best, exactly one writer. In the
eNoteHistory project, the classes are already known and the task is to classify new given feature vectors. We used the k-nearest-neighbor classification method
that categorizes existing feature vectors using the "Hamming distance" and it learns with each step.
Detailed and formal descriptions about the distance function can be found in the Master
Thesis (in German) of Lars Milewski.
Procedure During Querying
When a user is about to identify an unknown handwritten music score, the user must manually determine the handwriting characteristics of their music score and
then submit them to the system. All determined features of the corresponding handwriting form a feature vector. In order to identify the writer for the
submitted query feature vector, the distances to all feature vectors in the database are calculated using the classification method k-nearest-neighor. All
existing feature vectors in the database are selected if their distance to the query feature vector is less then the specified threshold. These selected
feature vectors represent the most similar handwriting characteristics for the unknown handwritten music score. Afterwards, the writer who corresponds to the
most feature vectors in the selected set of feature vectors will be identified. In case the number of belonging feature vectors for two writers are equal,
the writer, whose feature vector has the lowest distance to the query feature vector, will be taken. In the end, a sorted list of writers will be displayed.
The following graphic shows the procedure for extracting and classifying handwriting characteristics.
Evaluation of the Distance Function
In order to achieve the best results for writer identifications, it is necessary to evaluate the applied distance function and its parameters. To serve this
purpose, a scoring function for instance based clustering methods is used to measure cluster similarity. The distances between cluster instances shall be as
small as possible, however, one tries to maximize the distance between instances from different clusters. The K-measure attempts to evaluate the system
based on classification aspects. The resultant set of writers to be identified will be evaluated to determine whether it contains the correct writer. Thus, K is
the ratio of all queries in which the writer was correctly identified.
Choice of Distance Function
We have tested and evaluated several distance functions. As previously mentioned above, in our writer identification system, no clustering methods are necessary
to compute the clusters because all writer classes are already known. In order to evaluate the quality of the distance function and its parameters,
a clustering algorithm is nevertheless executed on the already existing feature vectors. Doing so, allows us to see whether the result delivers exactly the already
known classes. In this procedure, the "Haming distance" measured by the scoring function and the K-measure showed the best results.
Optimization of the Threshold
The scope of the result set of feature vectors depends on the threshold. Precision and Recall were used in order to meet the optimal suited threshold.
Precision is a measure for quality of a result set and represents the proportion of feature vectors that are really relevant. In contrast, Recall is a measure of
quantity of a result set and stands for the proportion of expected feature vectors which are really included in the result set. An optimal threshold was 0.13
with a Precision value of 85% and a Recall value of 70%.
Weight of Features
While calculating the distance between feature vectors, each feature receives a weight which represents the relevance of a feature in the distance
calculation. Thus, these weights influence the distance between two feature vectors. In order to optimize a certain feature weight, one calculates and
analyzes the distance without taking the corresponding feature into account. If the result has been improved, then the corresponding feature does not
have much relevance to the distance calculation like its weight represents. In case the result has been worsened, the corresponding feature does indeed have more
relevance and one must consider whether to increase its weight.
Furthermore, the distribution of the feature weights must be examined as well. To find a good compromise between a good scoring function and a good
K-measure, a linear weight distribution in which the features have different weights increasing arithmetically, fits best for this purpose. In our prototype,
we have taken a mixture of weights based either on the knowlegde of musicologists or on tests for finding weights.
Distance Matrices
The values being stored in the distance matrices define the similarities as well as the differences between two feature values. These values determined by
musicologists also need to be optimized. The results could be improved a little while decreasing the original values. In our prototype, the orignal values
were retained in most instances.
NULL Values
We classify NULL values for features into two different NULL value types:
- Non-Information-Null means that a certain feature was not used. One does not know how the writer would have written this feature because
it does not appear on the music score. This type of NULL value is neither included in the distance calculation nor does it have influence on the
classification result.
- No-Applicable-Null signals that the writer did not write this symbol. That means that a value will never exist for this certain
feature. Therefore, this NULL value will receive the maximal distance to all other values of the corresponding feature. This NULL value also
represents a value in the feature's domain and is located directly under the feature node in the feature base.
Normally, the user cannot distinguish between these two different NULL values. Therefore, the decision concerning what type of NULL value is made
automatically. Regardless, the results are better if as few NULL values as possible reside in the query feature vector.
Besides the distance, an ND-value is also calculated. The ND-value represents the proportion of features which are taken for the distance calculation. Since
NULL values have influence on the results, the distance is more reliable, the higher the ND-value is.
Due to these optimization tests, an optimal configuration of the distance function and its parameters as well as those of the system could be found. Thus, we
could improve our writer identification rate, increasing it up to 90%.