Library Documentation¶

Dedupe Objects¶

Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity.

class Dedupe(variable_definition, [data_sample[, [num_cores]])

Initialize a Dedupe object with a field definition

Parameters: variable_definition (dict) – A variable definition is list of dictionaries describing the variables will be used for training a model. num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine data_sample – __DEPRECATED__
# initialize from a defined set of fields
variables = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Zip', 'type': 'String', 'has missing':True},
{'field' : 'Phone', 'type': 'String', 'has missing':True}
]

deduper = dedupe.Dedupe(variables)

sample(data[, [sample_size=15000[, blocked_proportion=0.5[, original_length]]])

In order to learn how to deduplicate your records, dedupe needs a sample of your records to train on. This method takes a mixture of random sample of pairs of records and a selection of pairs of records that are much more likely to be duplicates.

Parameters: data (dict) – A dictionary-like object indexed by record ID where the values are dictionaries representing records. sample_size (int) – Number of record tuples to return. Defaults to 15,000. blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs. Defaults to 0.5. original_length – If data is a subsample of all your data, original_length should be the size of your complete data. By default, original_length defaults to the length of data.
deduper.sample(data_d, 150000, .5)

uncertainPairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

> pair = deduper.uncertainPairs()
> print pair
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]

markPairs(labeled_examples)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters: labeled_examples (dict) – a dictionary with two keys, match and distinct the values are lists that can contain pairs of records.
labeled_examples = {'match'    : [],
'distinct' : [({'name' : 'Georgie Porgie'},
{'name' : 'Georgette Porgette'})]
}
deduper.markPairs(labeled_examples)

train([recall=0.95[, index_predicates=True]])

Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.

Parameters: recall (float) – The proportion of true dupe pairs in our training data that that we the learned blocks must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare. recall should be a float between 0.0 and 1.0, the default is 0.95 index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take susbstantial memory. Defaults to True.
deduper.train()

writeTraining(file_obj)

Write json data that contains labeled examples to a file object.

Parameters: file_obj (file) – File object.
with open('./my_training.json', 'w') as f:
deduper.writeTraining(f)

readTraining(training_file)

Read training from previously saved training data file object

Parameters: training_file (file) – File object containing training data
with open('./my_training.json') as f:

cleanupTraining()

Delete data we used for training.

data_sample, training_pairs, training_data, and activeLearner can be very large objects. When you are done training you may want to free up the memory they use.

deduper.cleanupTraining()

threshold(data[, recall_weight=1.5])

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.

Parameters: data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data, recall_weight=2)
> print threshold
0.21

match(data[, threshold = 0.5[, generator=False]]])

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

This method should only used for small to moderately sized datasets for larger data, use matchBlocks

Parameters: data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names threshold (float) – a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold. Lowering the number will increase recall, raising it will increase precision generator (bool) – when True, match will generate a sequence of clusters, instead of a list. Defaults to False
> duplicates = deduper.match(data, threshold=0.5)
> print duplicates
[((1, 2, 3),
(0.790,
0.860,
0.790)),
((4, 5),
(0.720,
0.720)),
((10, 11),
(0.899,
0.899))]

classifier

By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.

from sklearn.linear_model import LogisticRegression

deduper = dedupe.Dedupe(fields)
deduper.classifier = LogisticRegression()

thresholdBlocks(blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

For larger datasets, you will need to use the thresholdBlocks and matchBlocks. This methods require you to create blocks of records. See the documentation for the matchBlocks method for how to construct blocks. .. code:: python

threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)

Keyword arguments

Parameters: blocks (list) – See matchBlocks recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks[, threshold=.5])

Partitions blocked data and generates a sequence of clusters, where each cluster is a tuple of record ids

Keyword arguments

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record. For example, if we have three records: (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block. blocks =(( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])) ), ( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])) ) ) deduper.matchBlocks(blocks)  Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again. threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision.
writeSettings(file_obj[, index=False])

Write a settings file that contains the data model and predicates to a file object.

Parameters: file_obj (file) – File object. bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f:
deduper.writeSettings(f, indexes=True)

loaded_indices

Indicates whether indices for index predicates was loaded from a settings file.

blocker(data[, target=False])

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). target (bool) – Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance. Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.blocker(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]

blocker.index_fields

A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.

blocker.index(field_data, field)

Indexes the data from a field for use in a index predicate.

Parameters: field data (set) – The unique field values that appear in your data. field (string) – The name of the field
for field in deduper.blocker.index_fields :
field_data = set(record[field] for record in data)
deduper.index(field_data, field)


StaticDedupe Objects¶

Class for deduplication using saved settings. If you have already trained dedupe, you can load the saved settings with StaticDedupe.

class StaticDedupe(settings_file[, num_cores])

Initialize a Dedupe object with saved settings

Parameters: settings_file (file) – A file object containing settings info produced from the Dedupe.writeSettings() of a previous, active Dedupe object. num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine
threshold(data[, recall_weight=1.5])

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.

Parameters: data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data, recall_weight=2)
> print threshold
0.21

match(data[, threshold = 0.5[, generator=False]]])

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

This method should only used for small to moderately sized datasets for larger data, use matchBlocks

Parameters: data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names threshold (float) – a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold. Lowering the number will increase recall, raising it will increase precision generator (bool) – when True, match will generate a sequence of clusters, instead of a list. Defaults to False
> duplicates = deduper.match(data, threshold=0.5)
> print duplicates
[((1, 2, 3),
(0.790,
0.860,
0.790)),
((4, 5),
(0.720,
0.720)),
((10, 11),
(0.899,
0.899))]

classifier

By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.

from sklearn.linear_model import LogisticRegression

deduper = dedupe.Dedupe(fields)
deduper.classifier = LogisticRegression()

thresholdBlocks(blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

For larger datasets, you will need to use the thresholdBlocks and matchBlocks. This methods require you to create blocks of records. See the documentation for the matchBlocks method for how to construct blocks. .. code:: python

threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)

Keyword arguments

Parameters: blocks (list) – See matchBlocks recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks[, threshold=.5])

Partitions blocked data and generates a sequence of clusters, where each cluster is a tuple of record ids

Keyword arguments

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record. For example, if we have three records: (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block. blocks =(( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])) ), ( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])) ) ) deduper.matchBlocks(blocks)  Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again. threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision.
writeSettings(file_obj[, index=False])

Write a settings file that contains the data model and predicates to a file object.

Parameters: file_obj (file) – File object. bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f:
deduper.writeSettings(f, indexes=True)

loaded_indices

Indicates whether indices for index predicates was loaded from a settings file.

blocker(data[, target=False])

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). target (bool) – Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance. Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.blocker(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]

blocker.index_fields

A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.

blocker.index(field_data, field)

Indexes the data from a field for use in a index predicate.

Parameters: field data (set) – The unique field values that appear in your data. field (string) – The name of the field
for field in deduper.blocker.index_fields :
field_data = set(record[field] for record in data)
deduper.index(field_data, field)


Gazetteer Objects¶

Class for active learning gazetteer matching.

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.

The interface is the same as for RecordLink objects except for a couple of methods.

class Gazetteer
index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
unindex(data) :

Remove records from the index of records to match against.

Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
match(messy_data[, threshold=0.5[, n_matches=1[, generator=False]]])

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters: messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names. threshold (float) – a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold. Lowering the number will increase recall, raising it will increase precision n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1 generator (bool) – when True, match will generate a sequence of possible matches, instead of a list. Defaults to False This makes match a lazy method.
threshold(messy_data, recall_weight = 1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.

Parameters: messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names. recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks, threshold=.5, n_matches=1)

Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the messy data set and the records from the canonical dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record. For example, if we have two records from a messy dataset one record from a canonical dataset: # Messy (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Canonical (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, blocks =(( [(1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([]))], [(3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])] ), ( [(1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1]), ((2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])], [((3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])] ) ) linker.matchBlocks(blocks)  threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision. n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
clustered_dupes = deduper.matchBlocks(blocked_data, threshold)

uncertainPairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

> pair = deduper.uncertainPairs()
> print pair
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]

markPairs(labeled_examples)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters: labeled_examples (dict) – a dictionary with two keys, match and distinct the values are lists that can contain pairs of records.
labeled_examples = {'match'    : [],
'distinct' : [({'name' : 'Georgie Porgie'},
{'name' : 'Georgette Porgette'})]
}
deduper.markPairs(labeled_examples)

train([recall=0.95[, index_predicates=True]])

Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.

Parameters: recall (float) – The proportion of true dupe pairs in our training data that that we the learned blocks must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare. recall should be a float between 0.0 and 1.0, the default is 0.95 index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take susbstantial memory. Defaults to True.
deduper.train()

writeTraining(file_obj)

Write json data that contains labeled examples to a file object.

Parameters: file_obj (file) – File object.
with open('./my_training.json', 'w') as f:
deduper.writeTraining(f)

readTraining(training_file)

Read training from previously saved training data file object

Parameters: training_file (file) – File object containing training data
with open('./my_training.json') as f:

cleanupTraining()

Delete data we used for training.

data_sample, training_pairs, training_data, and activeLearner can be very large objects. When you are done training you may want to free up the memory they use.

deduper.cleanupTraining()

classifier

By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.

from sklearn.linear_model import LogisticRegression

deduper = dedupe.Dedupe(fields)
deduper.classifier = LogisticRegression()

thresholdBlocks(blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

For larger datasets, you will need to use the thresholdBlocks and matchBlocks. This methods require you to create blocks of records. See the documentation for the matchBlocks method for how to construct blocks. .. code:: python

threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)

Keyword arguments

Parameters: blocks (list) – See matchBlocks recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks[, threshold=.5])

Partitions blocked data and generates a sequence of clusters, where each cluster is a tuple of record ids

Keyword arguments

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record. For example, if we have three records: (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block. blocks =(( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])) ), ( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])) ) ) deduper.matchBlocks(blocks)  Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again. threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision.
writeSettings(file_obj[, index=False])

Write a settings file that contains the data model and predicates to a file object.

Parameters: file_obj (file) – File object. bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f:
deduper.writeSettings(f, indexes=True)

loaded_indices

Indicates whether indices for index predicates was loaded from a settings file.

blocker(data[, target=False])

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). target (bool) – Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance. Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.blocker(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]

blocker.index_fields

A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.

blocker.index(field_data, field)

Indexes the data from a field for use in a index predicate.

Parameters: field data (set) – The unique field values that appear in your data. field (string) – The name of the field
for field in deduper.blocker.index_fields :
field_data = set(record[field] for record in data)
deduper.index(field_data, field)


StaticGazetteer Objects¶

Class for gazetter matching using saved settings. If you have already trained a gazetteer instance, you can load the saved settings with StaticGazetteer.

This class has the same interface as StaticRecordLink except for a couple of methods.

class StaticGazetteer
index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
unindex(data) :

Remove records from the index of records to match against.

Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
match(messy_data[, threshold=0.5[, n_matches=1[, generator=False]]])

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters: messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names. threshold (float) – a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold. Lowering the number will increase recall, raising it will increase precision n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1 generator (bool) – when True, match will generate a sequence of possible matches, instead of a list. Defaults to False This makes match a lazy method.
threshold(messy_data, recall_weight = 1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.

Parameters: messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names. recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks, threshold=.5, n_matches=1)

Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the messy data set and the records from the canonical dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record. For example, if we have two records from a messy dataset one record from a canonical dataset: # Messy (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Canonical (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, blocks =(( [(1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([]))], [(3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])] ), ( [(1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1]), ((2, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])], [((3, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])] ) ) linker.matchBlocks(blocks)  threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision. n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
clustered_dupes = deduper.matchBlocks(blocked_data, threshold)

classifier

By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.

from sklearn.linear_model import LogisticRegression

deduper = dedupe.Dedupe(fields)
deduper.classifier = LogisticRegression()

thresholdBlocks(blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

For larger datasets, you will need to use the thresholdBlocks and matchBlocks. This methods require you to create blocks of records. See the documentation for the matchBlocks method for how to construct blocks. .. code:: python

threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)

Keyword arguments

Parameters: blocks (list) – See matchBlocks recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
matchBlocks(blocks[, threshold=.5])

Partitions blocked data and generates a sequence of clusters, where each cluster is a tuple of record ids

Keyword arguments

Parameters: blocks (list) – Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record. For example, if we have three records: (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks: # Block 1 (Whole name) (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) # Block 2 (Whole name) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}) # Block 3 (Whole address (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}) (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’})  So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block. blocks =(( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([])) ), ( (1, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (2, {‘name’ : ‘Pat’, ‘address’ : ‘123 Main’}, set([1])), (3, {‘name’ : ‘Sam’, ‘address’ : ‘123 Main’}, set([])) ) ) deduper.matchBlocks(blocks)  Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again. threshold (float) – Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold. Lowering the number will increase recall, raising it will increase precision.
writeSettings(file_obj[, index=False])

Write a settings file that contains the data model and predicates to a file object.

Parameters: file_obj (file) – File object. bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f:
deduper.writeSettings(f, indexes=True)

loaded_indices

Indicates whether indices for index predicates was loaded from a settings file.

blocker(data[, target=False])

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). target (bool) – Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance. Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.blocker(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]

blocker.index_fields

A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.

blocker.index(field_data, field)

Indexes the data from a field for use in a index predicate.

Parameters: field data (set) – The unique field values that appear in your data. field (string) – The name of the field
for field in deduper.blocker.index_fields :
field_data = set(record[field] for record in data)
deduper.index(field_data, field)


Convenience Functions¶

consoleLabel(matcher)

Train a matcher instance (Dedupe or RecordLink) from the command line. Example

> deduper = dedupe.Dedupe(variables)
> deduper.sample(data)
> dedupe.consoleLabel(deduper)


Construct training data for consumption by the RecordLink.markPairs() from already linked datasets.

Parameters: data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names. data_2 (dict) – a dictionary of records from second dataset, same form as data_1 common_key (str) – the name of the record field that uniquely identifies a match training_size (int) – the rough limit of the number of training examples, defaults to 50000

Warning

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

trainingDataDedupe(data, common_key[, training_size])

Construct training data for consumption by the Dedupe.markPairs() from an already deduplicated dataset.

Parameters: data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names common_key (str) – the name of the record field that uniquely identifies a match training_size (int) – the rough limit of the number of training examples, defaults to 50000

Warning

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

canonicalize(record_cluster)

Constructs a canonical representation of a duplicate cluster by finding canonical values for each field

Parameters: record_cluster (list) – A list of records within a duplicate cluster, where the records are dictionaries with field names as keys and field values as values