Fast Duplicate Search in DataTable Data
Library: KS.FuzzyDupesNet.dll (DotNet 2.0/4.0 Assembly)
Version: 6.x/7.x
Dependencies: System/none
Interface
General Information
The FuzzyDupes:NET Assembly enables you to find similar records in database record data.
The data is processed in memory, the control implements optimized data structures in c# for this purpose. The algorithms used are universal for any string data. Adress data can be processed as any other data as well.
There is one main function that receives a System.Data.DataTable and returns another DataTable with the results. Some parameters can adjust the search.
Download
You can always find the current version at /download/FuzzyDupesNETSetup.exe
Different versions will have the same download filename.
License
Disclaimer: FuzzyDupes:NET Assembly is provided "as is" with no warranty of any kind.
FuzzyDupes:NET Assembly is not Freeware. The Demo Version shows an about screen prior to every search. With a valid license, the about screen is never shown.
The Demo Version is for testing purposes only and must not be used in a productive environment..
More information about licensing the library can be found at the FuzzyDupes:NET product page. Or contact info@kroll-software.ch
Installation
- Copy the file KS.FuzzyDupesNet.dll and \de\KS.FuzzyDupesNet.resources.dll to your harddrive
- Add a dependency to this assembly into your DotNet 2.0 project (VB.NET, C#, or whatever..)
- Check Local Copy and set Specific Version to false
Methods
AboutBox()
Return Value: void
Parameters: none or [ParentForm]
Shows the AboutBox.
Reset()
Return Value: void
Parameters: none
Resets the settings for the next search. Call this function prior to every search.
Cancel()
Return Value: void
Parameters: none
Cancels a long running process.
You can call this in a Progress-Event.
SetColumnOption()
Set options for each column
Return Value: void
Parameters: int ColumnIndex, bool Cluster, bool DupeSearch, double Weight, bool CompareNull, string ColumnName[, int ImportColumnIndex]
Set the Options for your Variant-Array.
- ColumnIndex is zero-based (first column has the index 0)
- Cluster: Include this column in the cluster search (default = false)
- DupeSearch: Include this column in the duplicate search (default = false)
- Weight: Weight for this column relative to other weights. Values 0 - 1 are weighted, Values > 1 require identical value (no fuzziness), Values < 0 require a different value (no fuzziness). (default = 0.5)
- CompareNull: NULL values in this column are compared (default = false*)
- ColumnName: Name of this column
- ImportColumnIndex **: Mapping to the column index in the second table for FuzzyMatch() and FuzzyMerge()
Call this prior to DupeSearch(), FuzzyMatch() and FuzzyMerge()
*) Important note: Set CompareNull to 'true' for columns that are filled with data in (nearly) all rows (e.g. Last Name, Street, ZIP, City) and set CompareNull to 'false' for other columns which may contain NULL values in many rows (e.g. First Name, Phone Number, ...)
**) FuzzyMatch() and FuzzyMerge() require a mapping and so a valid ImportColumnIndex set for each column
DupeSearch()
Search for duplicate values in a System.Data.DataTable
Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, double Threshold, double ClusterThreshold, DupeReturnResults ReturnResults, bool ShowProgress[, System.Windows.Forms.Form ParentForm]
- DataTable T: The data table to perform the search in
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- DupeReturnResults: DRR_DUPES|DRR_ALL - Return Dupes only, Return all records (default=DRR_DUPES)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog. (default = true)
The returned DataTable has an additional column containing GUIDs for each Record. These GUIDs are identical for similar records.
FuzzyMerge()
Merge two DataTables with fuzzyness
Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMerge, double Threshold, double ClusterThreshold, MergeReturnResults ReturnResults, bool ShowProgress[, System.Windows.Forms.Form ParentForm]
This method requires a ImportColumnIndex set for each column with SetColumnOptions()
- DataTable T: first data table
- DataTable TMerge: second data table to merge
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- ReturnResults: MRR_IMPORT|MRR_ALL|MRR_DUPES|MRR_CLEANMERGED - return new records only | return all records | return Dupes only | return the merged result without any dupes (default=MRR_IMPORT)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog. (default = true)
The returned DataTable has up to 2 additional columns (depending on ReturnResults)
- GUIDs for each Record
- Source of record (0=T, 1=TMerge)
These GUIDs are identical for similar records.
FuzzyMatch()
Matches two System.Data.DataTable with fuzzyness,
positive/negative match between two lists
Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMatch, double Threshold, double ClusterThreshold, MatchReturnResults ReturnResults, bool ShowProgress, System.Windows.Forms.Form ParentForm
This method requires a ImportColumnIndex set for each column with SetColumnOptions()
- DataTable T: first data table
- DataTable TMatch: second data table to match with
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- ReturnResults: MAR_ALL|MAR_POSITIVE|MAR_NEGATIVE - return all records | return positive matches | return negative matches (default=MAR_ALL)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = true)
The returned DataTable has 1 additional column containing 0=No-Match, or Index (1 based) from TMatch where a match was found
ExactMatch()
Matches two DataTables exact (without fuzzyness) for a given column
Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMatch, int MatchColumnIndex, ExactMatchReturnResults ReturnResult
- DataTable T: first data table
- DataTable TMatch: second data table to match with
- MatchColumnIndex: column index where data must match
- ReturnResults: MAR_POSITIVE|MAR_NEGATIVE - match positive / negative
DeleteDupes()
Delete duplicate values from a DataTable (result from another method) depending on a FuzzyDupesID (GUID)
Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, int FIDColumnIndex, int SortColumnIndex, int MaxColumn, DeleteOrders DeleteOrder, DeleteReturnResults ReturnResults
- DataTable T: the data
- FIDColumnIndex: Zero-based ID of the FuzzyDupesID (GUID) Column (default = -1)
- SortColumnIndex: index of the column to sort the results
- MaxColumn: only return MaxColumnIndex number of columns. Used to cut the FuzzyDupesID (GUID) from the result (default = -1)
- DeleteOrder: DO_KEEP_FIRST|DO_KEEP_LAST - remove all dupes except the first / last (default=DO_KEEP_FIRST)
- ReturnResults: DER_CLEAN|DER_DELETED - cleaned result / deleted records
Use this function, to get a clean (dupe free) search result:
- Search for dupes with DupeSearch() and ReturnResults=DRR_ALL
- Call this function to remove the dupes from the result
Canceled
This property returns true, when a long running process was canceled.
Return Value: boolean
Parameters: none
Working with Normalization Rules
The Method SetNormalizeOption() sets a normalization rule to a column.
AddNormalizeRule() adds a search/replace pair to the current normalize list
You can use the class NormalizeList to create new, load and save lists for normalization.
--- End ---