Dedupe, Data Cleansing, Data Quality and Record Linkage Solution
What is a fuzzy duplicate search ?
Databases can generate a list of identical data records within a matter of moments. They do this by creating indices. These tree-like data structures can find a particular record even in very large databases with just a few access operations. Searching for absolute duplicates, in other words completely identical data records, is therefore a very simple task.
In contrast, finding similar records, e.g. addresses containing minor spelling mistakes, reversals, missing letters etc. is a major problem for a computer.
While a person will see at first glance that two data records are similar,
the term similar is extremely difficult to express in computer rules (algorithms).
The other side of the coin is that a person will find it impossible to pick out duplicate records from a pool of just a few hundred data records. Duplicates typically account for at least 1% to 3% of each database that we come across — even the best maintained.
These duplicate records are a major source of increased costs when performing tasks such as sending out catalogs and also cause serious problems in terms of accountancy, support, controlling etc. Performing a fuzzy duplicate search is of particular importance if you are amalgamating data, e.g. following purchase of new addresses.
How does a fuzzy duplicate search work ?
Algorithms that compare character strings and detect recurring patterns within these strings are known as pattern matching algorithms.
A couple of these algorithms are well known throughout the IT industry and are commonly used for such purposes.
One such example is the Levenshtein distance metric (also known as the edit distance metric).
The edit distance metric shows the number of fundamental editing steps (insert, modify, delete) necessary to convert character string A into character string B. Pattern matching algorithms such as these are somewhat compute-intensive and so programs for finding fuzzy duplicates require very long computing times, often days or even months. (FuzzyDupes 5 can usually search 30,000 data records in less than a minute).
As the number of direct comparisons between two data records increases at least with the square of the number of data records, an attempt is made to group records before performing a search. This preselection enables searches to be performed even in very large databases. We will call this preselection process clustering.
For instance, a basic clustering process for address data would involve looking only at the first two characters of the zip code (assuming that these were entered correctly) and grouping the records according to these. This is a typical starting point for a duplicate search. Within this group, each record is then compared with all of the others using a tool such as the Levenshtein distance metric. However, when dealing with larger databases, this results in extremely long computing times.
Phonetic algorithms such as SoundEx or Metaphone are considerably more sophisticated and deliver better results. However, these place great emphasis on the initial letter. FuzzyDupes 3 used phonetic algorithms for the purposes of clustering.
Before clustering and pattern matching take place, a third step is required: this involves normalization of the data. In this process, umlauts and special characters are replaced, along with common abbreviations (such as Straße/Strasse to Str.) to improve the search results from the outset. These modifications are trivial and will only have a minor impact on the results. Some forms of normalization can even reduce the quality of the results. All forms of normalization can have a positive impact on certain records, while having a negative impact on others. The value of certain normalization rules is often heavily dependent on the data in question.
What is special about FuzzyDupes ?
As explained above, a fuzzy duplicate search essentially comprises three steps:
Normalization of data
Clustering/preselection
Use of a suitable pattern matching algorithm
1. Normalization
FuzzyDupes 5 features an editor to create customized normalization rules. A default normalization is also provided. This includes converting umlauts, special characters and double spaces, changing lower case text to upper case etc. The program also includes a number of additional pre-defined rules and common abbreviations that are virtually always beneficial when handling address data. You can also define your own rules.
2. Clustering
FuzzyDupes uses all the character strings present in the database to create a TriGram Hash index.
Trigrams are all the three-character sets that occur within the data.
E.g. in "Kroll", the trigrams would be _KR, KRO, ROL, OLL, LL_.
This process is already widely known. What is special about our solution is that it comes with an extremely powerful cluster engine that is able to compare some 10 million trigrams per second on a standard PC!
A trigram index of this caliber offers superb clustering with mathematical precision and optimum selectivity, while factoring in all data permutations such as reversals, mirrored strings, insertions etc.
Our clustering process allows searches to be performed even in very large databases within a relatively short space of time, all the while delivering reliable results. For best results, it is important that the user selects suitable columns for the cluster search.
3. Pattern matching
The best-known and most widely used algorithm is probably the Levenshtein distance metric described above. However, we choose not to use this.
The Levenshtein distance metric and other well-known algorithms offer good results on the whole, but fall down when character strings are mirrored once or more. In cases such as this, all well-known algorithms will report a maximum of 50% similarity. In contrast, a person would see a higher degree of similarity (e.g. with "Detlef Kroll" and "Kroll, Detlef").
FuzzyDupes 5 uses an internally developed algorithm that has the power to report all possible permutations within a character string to best effect.
Why is FuzzyDupes so affordable compared with other duplicate search programs ?
Duplicate search programs used to be specialized solutions and the preserve of a very limited clientele. In addition, these applications could only be run on mainframe computers due to the high computing power required. As a result, the applications were very expensive.
We believe that the ability to perform fuzzy duplicate searches is crucial for every company maintaining a customer database. We want to make our application accessible to small and medium sized enterprises and recognize that the price must stand in direct relation to the benefit gained. These considerations form the basis for the price of our product. However, the benefit for your company can far exceed the cost of a FuzzyDupes license.
Why does the new demo version offer full functionality ?
We have learned from experience that it is difficult to persuade new users of the necessity of a duplicate search if the demo version is too limited. Until now, we have only issued full-featured versions on request.
However, we believe that a demo version that is too limited is of no use to anybody. With this in mind, our latest demo version allows you to try out the full functionality of the program and assess a complete list of search results.
Please note: This free demo version is supplied for trial purposes only as a means of helping you to decide whether the application lives up to its claims and whether or not you have a need for a duplicate search program.
To use the search results, you will need to purchase a license. We call this "fair software". Please be fair and obtain a license for this software if you wish to use it in a productive environment.
What new features are offered by FuzzyDupes 2007 ?
Complete migration to DotNet 2.0/C#.
This makes the program fit for the future and stable.
Enhanced algorithms, in particular our new pattern matching algorithm (see above)
Uses considerably less memory
User-defined normalization rules
Full Unicode support This generally allows fuzzy duplicate searches to be performed in Unicode languages
Improved user interface
Direct deletion from MS Outlook and Windows address book
Many other refinements
The program needs to create large data structures in the computer's main memory.
For this reason, performance is limited to the amount of memory available.
Under a 32-bit operating system, up to around 2.5 GB of memory can be addressed.
With the previous version, this was sufficient to search some 300,000 to 500,000 records.
Version 5 raises the bar even higher.
Supported Data Sources / DBMS:
MS-Access, MS-Access 2007*
MS SQL-Server
MS-Excel, MS-Excel 2007*
Text/CSV Files
Other Datasources with ODBC-Driver or OLEdb Provider, e.g. Oracle, MySQL, dBase, Foxpro, Paradox, FileMaker, Cache, PostgreSQL, etc.
Improved search end deletion from MS-Outlook contact folders.
This makes FuzzyDupes the solution for cleansing Outlook contacts
These downloads can be used for a limited 30-day demo.
The full version can then be activated by purchasing a licence.
Order
You can test FuzzyDupes 2007 for free and with no obligation for 30 days,
after which you are required to usage license, if you want to continue using the program.
A single work-station** license costs EUR 249.- net amount*
*) All prices are net amounts.
If and how much VAT you have to pay depends on how and from where you place your order.
More information can be found at Shop
**) The license permits use of the software at one work station for an unlimited time. There are no further costs. Price includes updates to all 5.x versions and free support by email or phone.
Order FuzzyDupes 2007
Secure payment via credit card, bank transfer or check
over the company ShareIt! element 5 AG, Koeln
The update is free for registered users of version 4.x, if you ordered the software after Dec. 16, 2005.
If your license is older, you can order an update license for EUR 119,-
Sample Search Result
(FuzzyDupes finds similar records in address databases.)
If you have questions about this product or ordering,
dont hesitate to contact us. Well be happy to advise you.
Phone: +41-41-5351767 (Switzerland)
These customers have decided for FuzzyDupes:
Acromag, Inc.,
Arnold & Porter LLP,
Axonmedia GmbH,
BAUHERR GmbH,
Boesner GmbH,
BFB Branchen-Fernsprechbuch GmbH,
BOTAMENT Systembaustoffe GmbH & Co.KG,
Bundesanstalt für Arbeitsschutz und Arbeitsmedizin,
CAQ AG Factory Systems,
Citigroup USA,
Coalition America,
COMIT AG,
CompuMED GmbH & Co.KG,
CreditPlus Bank AG,
Danfoss GmbH,
Degussa AG,
Deutsche Bahn AG,
Deutsche Lufthansa AG,
DHL Solutions GmbH,
Dresdner Bank Luxemburg S.A.,
DuPont Performance Coatings GmbH & Co. KG,
EDB Group,
E.ON Ruhrgas AG,
Erlau AG,
European Businessguide GmbH,
Familotel AG,
fischerwerke,
Fraunhofer IML,
Fresenius Netcare GmbH,
Handwerkskammer Hamburg,
Hewlett Packard EMEA GmbH,
Hilti, Inc.,
Hirschfeld Touristik Event GmbH & Co.KG,
InterRisk Versicherungs AG,
Kraft USA,
LIDL Stiftung & Co KG,
Liechtensteinische Post AG,
Maritim Hotelgesellschaft mbH,
music-city Steinbrecher GmbH & Co.KG,
Oberfinanzdirektion Frankfurt,
Oberfinanzdirektion Hannover,
Oberfinanzdirektion Karlsruhe Landeszentrum f. Datenverarbeitung,
Oberfinanzdirektion Magdeburg,
OÖ. Tourismus Technologie GmbH,
ORWO Media GmbH,
OSRAM GmbH,
P&I Personal & Informatik AG,
SCA Packaging Deutschland,
SGI-USA,
Siemens VDO Automative AG,
Stadt Göttingen,
Stadt Münster,
Stadt Solingen,
Toys "R" Us GmbH,
Volksbank Bad Saulgau eG,
Vorarlberger Volksbank,
Westermann AG,
Wincare Versicherungen,
Wirtschaftskammer Oberöstereich,
WTS Schaltgeräte GmbH,
Xella Baustoffe GmbH,
Zürcher Hochschule für Angewandte Wissenschaften,
u.v.m.
Youll find more information on FuzzyDupes at our online help page.
This software product was tested in the Softpedia labs.
Softpedia guarantees that FuzzyDupes 2007 is 100% CLEAN, which means is does not contain any form of malware, including but not limited to: spyware, viruses, trojans and backdoors.
FuzzyDupes Duplicate Search in your Applications
You want to integrate FuzzyDupes into your applications ?
We provide you with a DotNet 2.0 Assembly or a COM-Object and free developer support.