Similarity is mathematically difficult to express.
The currently known approaches go in two directions:
- Distance (metric methods)
- The Number of matching elements in relation to the total set amount (stochastic methods)
What is meant here is mainly the distance in the form of computation steps or basic operations such as inserting, deleting and replacing elements in order to convert the first to the other.
The best known of these methods is the Levenshtein distance, which works exactly as previously explained. The calculation is not easy, because we will be looking for an optimal solution within a search tree.
Metric is a word that can be interpreted very widely.
The Pearson coefficient is more statistic, the Manhattan metric more geometric, and both are quite different.
Jaccard similarity coefficient
Very different and very simple the method of the Swiss botanist Paul Jaccard:
The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets.
You could even say it in more simple words without mentioning the set theory.
Permutation is another important word in this context.
A good metric for similarity should be able to consider all permutations seamlessly.
A human would say, that both strings are nearly the same, just twisted around.
The Levenshtein-Distanz would have to delete "Kroll, ", which are 7 characters, then add a space and "Kroll" at the end, which gives exactly a similarity of 0.5 (13 operations with n=13). This still istn't a very smart solution to our problem.
Importance for mathematics
As mentioned above, this article should not refer solely to the search for duplicates.
The concept of similarity has a fundamental importance when it comes to discrimination, or if we want to classify something. Both is only possible, if we can express the similarity between two entities in a reliable way.
Please also see the article "What is intelligence then ?" on this site.