Glossary

Cluster Creation

A fuzzy comparison of all records in a database with all others is impossible because the number of the individual comparisons would quickly reach an astronomical figure.

Any software for duplicate searches must perform a pre-selection, in which data sets are grouped in blocks which are appropriate for performing fuzzy comparison.

FuzzyDupes uses a method for cluster creation, which calculates reliable and small clusters very fast. The method is based on N-Grams, that are small partial strings, which the program can process very efficiently.

This method is mathematically exact and the cluster size can be freely adjusted. The current, average and maximal cluster size are displayed during the duplicate search.

You should experiment a little with different cluster columns or thresholds to see, if you can still improve the results.

Normalization

Before the strings of the individual columns can undergo the fuzzy comparison process, the data is (temporarily, in memory) normalized. Here special characters and umlauts are replaced and many normal abbreviations and spellings are transformed. (e.g. street -> st.)

Fuzzy comparison

Here the correspondence (in percent) between individual columns is calculated with suitable algorithms. This is used to calculate (taking different weightings into consideration), the average correspondence.

Several different pattern-matching algorithms exist which are more or less well-suited for this purpose. Every duplicate software maker implements his own process. A well-suited process should be able to deal well with permutations of characters in order to achieve a high selectivity.

If, at the end, the correspondence (in percentage) lies over the established threshold value, these records are marked at duplicates. (they get the same fuzzydupes_ID).

Selectivity

The quality of a duplicate search is not only measured by how many duplicates are discovered, but also by how reliable the result is, or how many false duplicates are found.

Also important is the influence changes in the threshold value have on the search result. It would be disadvantageous if small changes in the threshold value led to drastic differences in the search result and if the selectivity was negatively influenced by lower threshold values.

You will soon realize that in FuzzyDupes the threshold value has no significant influence on the search result (perhaps tiny differences in the value after the decimal point). Even with lower threshold values, FuzzyDupes achieve a very good degree of selectivity.