Library Documentation

`Dedupe` Objects

class dedupe.Dedupe(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]

Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity.

Parameters

variable_definition (Collection[VariableDefinition]) – A list of dictionaries describing the variables will be used for training a model. See Variable Definitions
num_cores (int | None) – The number of cpus to use for parallel processing. If set to None, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.Dedupe(variables)

prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9)[source]

Initialize the active learner with your data and, optionally, existing training data.

Sets up the learner.

Parameters

data (Data) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
training_file (TextIO | None) – file object containing training data
sample_size (int) – Size of the sample to draw
blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.

Examples

>>> matcher.prepare_training(data_d, 150000, .5)

>>> with open('training_file.json') as f:
>>>     matcher.prepare_training(data_d, training_file=f)

uncertain_pairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

Examples

>>> pair = matcher.uncertain_pairs()
>>> print(pair)
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]

mark_pairs(labeled_pairs)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters: labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct the values are lists that can contain pairs of records

Examples

>>> labeled_examples = {
>>>     "match": [],
>>>     "distinct": [
>>>         (
>>>             {"name": "Georgie Porgie"},
>>>             {"name": "Georgette Porgette"},
>>>         )
>>>     ],
>>> }
>>> matcher.mark_pairs(labeled_examples)

Note

mark_pairs() is primarily designed to be used with uncertain_pairs() to incrementally build a training set.

If you have existing training data, you should likely format the data into the right form and supply the training data to the prepare_training() method with the training_file argument.

If that is not possible or desirable, you can use mark_pairs() to train a linker with existing data. However, you must ensure that every record that appears in the labeled_pairs argument appears in either the data or training file supplied to the prepare_training() method.

train(recall=1.0, index_predicates=True)

Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.

Parameters

recall (float) –
The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.

recall should be a float between 0.0 and 1.0.
index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.

write_training(file_obj)

Write a JSON file that contains labeled examples

Parameters: file_obj (TextIO) – file object to write training data to

Examples

>>> with open('training.json', 'w') as f:
>>>     matcher.write_training(f)

write_settings(file_obj)

Write a settings file containing the data model and predicates to a file object

Parameters: file_obj (BinaryIO) – file object to write settings data into

Examples

>>> with open('learned_settings', 'wb') as f:
>>>     matcher.write_settings(f)

cleanup_training(): Clean up data we used for training. Free up memory.

partition(data, threshold=0.5)

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

For details on the confidence score, see dedupe.Dedupe.cluster().

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to score().

Parameters

data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
threshold –
Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

Lowering the number will increase recall, raising it will increase precision

Examples

>>> duplicates = matcher.partition(data, threshold=0.5)
>>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]

`StaticDedupe` Objects

class dedupe.StaticDedupe(settings_file, num_cores=None, in_memory=False, **kwargs)[source]

Class for deduplication using saved settings. If you have already trained a Dedupe object and saved the settings, you can load the saved settings with StaticDedupe.

Parameters

settings_file (BinaryIO) – A file object containing settings info produced from the write_settings() method.
num_cores (int | None) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

with open('learned_settings', 'rb') as f:
    matcher = StaticDedupe(f)

partition(data, threshold=0.5)

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

For details on the confidence score, see dedupe.Dedupe.cluster().

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to score().

Parameters

data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
threshold –
Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

Lowering the number will increase recall, raising it will increase precision

Examples

>>> duplicates = matcher.partition(data, threshold=0.5)
>>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]

`RecordLink` Objects

class dedupe.RecordLink(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]

Class for active learning record linkage.

Use RecordLinkMatching when you have two datasets that you want to join.

Parameters

variable_definition (Collection[VariableDefinition]) – A list of dictionaries describing the variables will be used for training a model. See Variable Definitions
num_cores (int | None) – The number of cpus to use for parallel processing. If set to None, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.RecordLink(variables)

prepare_training(data_1, data_2, training_file=None, sample_size=1500, blocked_proportion=0.9)

Initialize the active learner with your data and, optionally, existing training data.

Parameters

data_1 (Data) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 (Data) – Dictionary of records from second dataset, same form as data_1
training_file (TextIO | None) – file object containing training data
sample_size (int) – The size of the sample to draw.
blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.

Examples

>>> matcher.prepare_training(data_1, data_2, 150000)

or

>>> with open('training_file.json') as f:
>>>     matcher.prepare_training(data_1, data_2, training_file=f)

uncertain_pairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

Examples

>>> pair = matcher.uncertain_pairs()
>>> print(pair)
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]

mark_pairs(labeled_pairs)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters: labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct the values are lists that can contain pairs of records

Examples

>>> labeled_examples = {
>>>     "match": [],
>>>     "distinct": [
>>>         (
>>>             {"name": "Georgie Porgie"},
>>>             {"name": "Georgette Porgette"},
>>>         )
>>>     ],
>>> }
>>> matcher.mark_pairs(labeled_examples)

Note

mark_pairs() is primarily designed to be used with uncertain_pairs() to incrementally build a training set.

If you have existing training data, you should likely format the data into the right form and supply the training data to the prepare_training() method with the training_file argument.

If that is not possible or desirable, you can use mark_pairs() to train a linker with existing data. However, you must ensure that every record that appears in the labeled_pairs argument appears in either the data or training file supplied to the prepare_training() method.

train(recall=1.0, index_predicates=True)

Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.

Parameters

recall (float) –
The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.

recall should be a float between 0.0 and 1.0.
index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.

write_training(file_obj)

Write a JSON file that contains labeled examples

Parameters: file_obj (TextIO) – file object to write training data to

Examples

>>> with open('training.json', 'w') as f:
>>>     matcher.write_training(f)

write_settings(file_obj)

Write a settings file containing the data model and predicates to a file object

Parameters: file_obj (BinaryIO) – file object to write settings data into

Examples

>>> with open('learned_settings', 'wb') as f:
>>>     matcher.write_settings(f)

cleanup_training(): Clean up data we used for training. Free up memory.

join(data_1, data_2, threshold=0.5, constraint='one-to-one')

Identifies pairs of records that refer to the same entity.

Returns pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to the score().

Parameters

data_1 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from second dataset, same form as data_1
threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision
constraint (Literal['one-to-one', 'many-to-one', 'many-to-many']) –
What type of constraint to put on a join.

’one-to-one’
Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. This is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.

’many-to-one’
Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. This is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.

’many-to-many’
Every record in data_1 can match multiple records in data_2 and vice versa. This is like a SQL inner join.

Examples

>>> links = matcher.join(data_1, data_2, threshold=0.5)
>>> list(links)
[
    ((1, 2), 0.790),
    ((4, 5), 0.720),
    ((10, 11), 0.899)
]

`StaticRecordLink` Objects

class dedupe.StaticRecordLink(settings_file, num_cores=None, in_memory=False, **kwargs)[source]

Class for record linkage using saved settings. If you have already trained a RecordLink instance, you can load the saved settings with StaticRecordLink.

Parameters

settings_file (BinaryIO) – A file object containing settings info produced from the write_settings() method.
num_cores (int | None) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

with open('learned_settings', 'rb') as f:
    matcher = StaticRecordLink(f)

join(data_1, data_2, threshold=0.5, constraint='one-to-one')

Identifies pairs of records that refer to the same entity.

Returns pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to the score().

Parameters

data_1 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from second dataset, same form as data_1
threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision
constraint (Literal['one-to-one', 'many-to-one', 'many-to-many']) –
What type of constraint to put on a join.

’one-to-one’
Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. This is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.

’many-to-one’
Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. This is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.

’many-to-many’
Every record in data_1 can match multiple records in data_2 and vice versa. This is like a SQL inner join.

Examples

>>> links = matcher.join(data_1, data_2, threshold=0.5)
>>> list(links)
[
    ((1, 2), 0.790),
    ((4, 5), 0.720),
    ((10, 11), 0.899)
]

`Gazetteer` Objects

class dedupe.Gazetteer(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]

Class for active learning gazetteer matching.

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’. This class is useful for such tasks as matching messy addresses against a clean list

Parameters

variable_definition (Collection[VariableDefinition]) – A list of dictionaries describing the variables will be used for training a model. See Variable Definitions
num_cores (int | None) – The number of cpus to use for parallel processing. If set to None, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
matcher = dedupe.Gazetteer(variables)

prepare_training(data_1, data_2, training_file=None, sample_size=1500, blocked_proportion=0.9)

Initialize the active learner with your data and, optionally, existing training data.

Parameters

data_1 (Data) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 (Data) – Dictionary of records from second dataset, same form as data_1
training_file (TextIO | None) – file object containing training data
sample_size (int) – The size of the sample to draw.
blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.

Examples

>>> matcher.prepare_training(data_1, data_2, 150000)

or

>>> with open('training_file.json') as f:
>>>     matcher.prepare_training(data_1, data_2, training_file=f)

uncertain_pairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

Examples

>>> pair = matcher.uncertain_pairs()
>>> print(pair)
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]

mark_pairs(labeled_pairs)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters: labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct the values are lists that can contain pairs of records

Examples

>>> labeled_examples = {
>>>     "match": [],
>>>     "distinct": [
>>>         (
>>>             {"name": "Georgie Porgie"},
>>>             {"name": "Georgette Porgette"},
>>>         )
>>>     ],
>>> }
>>> matcher.mark_pairs(labeled_examples)

Note

mark_pairs() is primarily designed to be used with uncertain_pairs() to incrementally build a training set.

If you have existing training data, you should likely format the data into the right form and supply the training data to the prepare_training() method with the training_file argument.

If that is not possible or desirable, you can use mark_pairs() to train a linker with existing data. However, you must ensure that every record that appears in the labeled_pairs argument appears in either the data or training file supplied to the prepare_training() method.

train(recall=1.0, index_predicates=True)

Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.

Parameters

recall (float) –
The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.

recall should be a float between 0.0 and 1.0.
index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.

write_training(file_obj)

Write a JSON file that contains labeled examples

Parameters: file_obj (TextIO) – file object to write training data to

Examples

>>> with open('training.json', 'w') as f:
>>>     matcher.write_training(f)

write_settings(file_obj)

Write a settings file containing the data model and predicates to a file object

Parameters: file_obj (BinaryIO) – file object to write settings data into

Examples

>>> with open('learned_settings', 'wb') as f:
>>>     matcher.write_settings(f)

cleanup_training(): Clean up data we used for training. Free up memory.

index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters: data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

unindex(data)

Remove records from the index of records to match against.

Parameters: data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

search(data, threshold=0.0, n_matches=1, generator=False)

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters

data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
threshold –
a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision
n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to None all possible matches above the threshold will be returned.
generator – when True, match will generate a sequence of possible matches, instead of a list.

Examples

>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2)
>>> print(matches)
[
    (((1, 6), 0.72), ((1, 8), 0.6)),
    (((2, 7), 0.72),),
    (((3, 6), 0.72), ((3, 8), 0.65)),
    (((4, 6), 0.96), ((4, 5), 0.63)),
]

`StaticGazetteer` Objects

class dedupe.StaticGazetteer(settings_file, num_cores=None, in_memory=False, **kwargs)[source]

Class for gazetter matching using saved settings.

If you have already trained a Gazetteer instance, you can load the saved settings with StaticGazetteer.

Parameters

settings_file (BinaryIO) – A file object containing settings info produced from the write_settings() method.
num_cores (int | None) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.
in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

with open('learned_settings', 'rb') as f:
    matcher = StaticGazetteer(f)

index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters: data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

unindex(data)

Remove records from the index of records to match against.

Parameters: data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

search(data, threshold=0.0, n_matches=1, generator=False)

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters

data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
threshold –
a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision
n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to None all possible matches above the threshold will be returned.
generator – when True, match will generate a sequence of possible matches, instead of a list.

Examples

>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2)
>>> print(matches)
[
    (((1, 6), 0.72), ((1, 8), 0.6)),
    (((2, 7), 0.72),),
    (((3, 6), 0.72), ((3, 8), 0.65)),
    (((4, 6), 0.96), ((4, 5), 0.63)),
]

blocks(data)

Yield groups of pairs of records that share fingerprints.

Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.

Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly many_to_n(), assumes that every pair of records is compared no more than once.

Parameters: data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples

>>> blocks = matcher.pairs(data)
>>> print(list(blocks)
[
    [
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (8, {"name": "Pat", "address": "123 Main"}),
        ),
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (9, {"name": "Sam", "address": "123 Main"}),
        ),
    ],
    [
        (
            (2, {"name": "Sam", "address": "2600 State"}),
            (5, {"name": "Pam", "address": "2600 Stat"}),
        ),
        (
            (2, {"name": "Sam", "address": "123 State"}),
            (7, {"name": "Sammy", "address": "123 Main"}),
        ),
    ],
]

score(blocks)

Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.

Parameters: blocks (Union[Iterator[List[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]]], Iterator[List[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]]) – Iterator of blocks of records

many_to_n(score_blocks, threshold=0.0, n_matches=1)

For each group of scored pairs, yield the highest scoring N pairs

Parameters

score_blocks (Iterable[Union[memmap, ndarray]]) – Iterator of numpy structured arrays, each with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.
threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision
n_matches (int) – How many top scoring pairs to select per group

Lower Level Classes and Methods

With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You’ll need to interact with some of the lower level classes and methods.

Dedupe and StaticDedupe

class dedupe.Dedupe[source]

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.
pairs(data)
Yield pairs of records that share common fingerprints.

Each pair will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly cluster(), assumes that every pair of records is compared no more than once.

Parameters

data (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples
>>> pairs = matcher.pairs(data)
>>> list(pairs)
[
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (2, {"name": "Pat", "address": "123 Main"}),
    ),
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (3, {"name": "Sam", "address": "123 Main"}),
    ),
]
score(pairs)

Scores pairs of records. Returns pairs of tuples of records id and associated probabilities that the pair of records are match

Parameters

pairs (Union[Iterator[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]], Iterator[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]) – Iterator of pairs of records
cluster(scores, threshold=0.5)
From the similarity scores of pairs of records, decide which groups of records are all referring to the same entity.

Yields tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

Each confidence scores is a measure of how similar the record is to the other records in the cluster. Let \(\phi(i,j)\) be the pair-wise similarity between records \(i\) and \(j\). Let \(N\) be the number of records in the cluster.

\[\text{confidence score}_i = 1 - \sqrt {\frac{\sum_{j}^N (1 - \phi(i,j))^2}{N -1}}\]

This measure is similar to the average squared distance between the focal record and the other records in the cluster. These scores can be combined to give a total score for the cluster.

\[\text{cluster score} = 1 - \sqrt { \frac{\sum_i^N(1 - \mathrm{score}_i)^2 \cdot (N - 1) } { 2 N^2}}\]

Parameters

scores (Union[memmap, ndarray]) –
a numpy structured array with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

For each pair, the smaller id should be first.

threshold (float) –
Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

Lowering the number will increase recall, raising it will increase precision

Examples
>>> pairs = matcher.pairs(data)
>>> scores = matcher.scores(pairs)
>>> clusters = matcher.cluster(scores)
>>> list(clusters)
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]

class dedupe.StaticDedupe[source]

fingerprinter

Instance of dedupe.blocking.Fingerprinter class

pairs(data)

Same as dedupe.Dedupe.pairs()

score(pairs)

Same as dedupe.Dedupe.score()

cluster(scores, threshold=0.5)

Same as dedupe.Dedupe.cluster()

RecordLink and StaticRecordLink

class dedupe.RecordLink[source]

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.
pairs(data_1, data_2)
Yield pairs of records that share common fingerprints.

Each pair will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly one_to_one(), and many_to_one() assumes that every pair of records is compared no more than once.

Parameters

data_1 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names

data_2 (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records from second dataset, same form as data_1

Examples
>>> pairs = matcher.pairs(data_1, data_2)
>>> list(pairs)
[
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (2, {"name": "Pat", "address": "123 Main"}),
    ),
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (3, {"name": "Sam", "address": "123 Main"}),
    ),
]
score(pairs)

Scores pairs of records. Returns pairs of tuples of records id and associated probabilities that the pair of records are match

Parameters

pairs (Union[Iterator[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]], Iterator[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]) – Iterator of pairs of records
one_to_one(scores, threshold=0.0)
From the similarity scores of pairs of records, decide which pairs refer to the same entity.

Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. See https://en.wikipedia.org/wiki/Injective_function.

This method is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.

Yields pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.

Parameters

scores (Union[memmap, ndarray]) –
a numpy structured array with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision

Examples
>>> pairs = matcher.pairs(data)
>>> scores = matcher.scores(pairs, threshold=0.5)
>>> links = matcher.one_to_one(scores)
>>> list(links)
[
    ((1, 2), 0.790),
    ((4, 5), 0.720),
    ((10, 11), 0.899)
]
many_to_one(scores, threshold=0.0)
From the similarity scores of pairs of records, decide which pairs refer to the same entity.

Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. See https://en.wikipedia.org/wiki/Surjective_function

This method is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.

Yields pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.

Parameters

scores (Union[memmap, ndarray]) –
a numpy structured array with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision

Examples
>>> pairs = matcher.pairs(data)
>>> scores = matcher.scores(pairs, threshold=0.5)
>>> links = matcher.many_to_one(scores)
>>> print(list(links))
[
    ((1, 2), 0.790),
    ((4, 5), 0.720),
    ((7, 2), 0.623),
    ((10, 11), 0.899)
 ]

class dedupe.StaticRecordLink[source]

fingerprinter: Instance of dedupe.blocking.Fingerprinter class

pairs(data_1, data_2): Same as dedupe.RecordLink.pairs()

score(pairs): Same as dedupe.RecordLink.score()

one_to_one(scores, threshold=0.0): Same as dedupe.RecordLink.one_to_one()

many_to_one(scores, threshold=0.0): Same as dedupe.RecordLink.many_to_one()

Gazetteer and StaticGazetteer

class dedupe.Gazetteer[source]

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.
blocks(data)
Yield groups of pairs of records that share fingerprints.

Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.

Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly many_to_n(), assumes that every pair of records is compared no more than once.

Parameters

data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples
>>> blocks = matcher.pairs(data)
>>> print(list(blocks)
[
    [
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (8, {"name": "Pat", "address": "123 Main"}),
        ),
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (9, {"name": "Sam", "address": "123 Main"}),
        ),
    ],
    [
        (
            (2, {"name": "Sam", "address": "2600 State"}),
            (5, {"name": "Pam", "address": "2600 Stat"}),
        ),
        (
            (2, {"name": "Sam", "address": "123 State"}),
            (7, {"name": "Sammy", "address": "123 Main"}),
        ),
    ],
]
score(blocks)

Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.

Parameters

blocks (Union[Iterator[List[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]]], Iterator[List[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]]) – Iterator of blocks of records

many_to_n(score_blocks, threshold=0.0, n_matches=1)

For each group of scored pairs, yield the highest scoring N pairs

Parameters

score_blocks (Iterable[Union[memmap, ndarray]]) –
Iterator of numpy structured arrays, each with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

threshold (float) –
Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

Lowering the number will increase recall, raising it will increase precision

n_matches (int) – How many top scoring pairs to select per group

class dedupe.StaticGazeteer

fingerprinter

Instance of dedupe.blocking.Fingerprinter class

blocks(data)

Same as dedupe.Gazetteer.blocks()

score(blocks)

Same as dedupe.Gazetteer.score()

many_to_n(score_blocks, threshold=0.0, n_matches=1)

Same as dedupe.Gazetteer.many_to_n()

`Fingerprinter` Objects

class dedupe.blocking.Fingerprinter(predicates)[source]

Takes in a record and returns all blocks that record belongs to

__call__(records, target=False)[source]

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters

records – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items().
target –
Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance.

Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a "name" field; and we have a record like {"name": "thomas"}. If the target is set to True then the predicate will return "thomas". If target is set to False, then the blocker could return "thomas", "tomas", and "thoms". By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.

> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.fingerprinter(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]

index_fields: dict[str, IndexList]: A dictionary of all the fingerprinter methods that use an index of data field values. The keys are the field names, which can be useful to know for indexing the data.

index(docs, field)[source]

Add docs to the indices used by fingerprinters.

Some fingerprinter methods depend upon having an index of values that a field may have in the data. This method adds those values to the index. If you don’t have any fingerprinter methods that use an index, this method will do nothing.

Parameters

docs (Union[Iterable[str], Iterable[Iterable[str]]]) – an iterator of values from your data to index. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.
field (str) – fieldname or key associated with the values you are indexing

unindex(docs, field)[source]

Remove docs from indices used by fingerprinters

Parameters

docs (Union[Iterable[str], Iterable[Iterable[str]]]) – an iterator of values from your data to remove. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.
field (str) – fieldname or key associated with the values you are unindexing

reset_indices()[source]: Fingeprinter indicdes can take up a lot of memory. If you are done with blocking, the method will reset the indices to free up. If you need to block again, the data will need to be re-indexed.

Convenience Functions

dedupe.console_label(deduper)[source]

Train a matcher instance (Dedupe, RecordLink, or Gazetteer) from the command line. Example

> deduper = dedupe.Dedupe(variables)
> deduper.prepare_training(data)
> dedupe.console_label(deduper)

dedupe.training_data_dedupe(data, common_key, training_size=50000)[source]

Construct training data for consumption by the func:mark_pairs method from an already deduplicated dataset.

Parameters

data – Dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field names
common_key – The name of the record field that uniquely identifies a match
training_size – the rough limit of the number of training examples, defaults to 50000

Note

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

dedupe.training_data_link(data_1, data_2, common_key, training_size=50000)[source]

Construct training data for consumption by the func:mark_pairs method from already linked datasets.

Parameters

data_1 – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 – Dictionary of records from second dataset, same form as data_1
common_key – The name of the record field that uniquely identifies a match
training_size – the rough limit of the number of training examples, defaults to 50000

Note

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

dedupe.canonicalize(record_cluster)[source]

Constructs a canonical representation of a duplicate cluster by finding canonical values for each field

Parameters: record_cluster – A list of records within a duplicate cluster, where the records are dictionaries with field names as keys and field values as values

dedupe.read_training(training_file)[source]

Read training from previously built training data file object

Parameters: training_file (TextIO) – file object containing the training data
Returns: A dictionary with two keys, match and distinct. See the inverse, write_training().

dedupe.write_training(labeled_pairs, file_obj)[source]

Write a JSON file that contains labeled examples

Parameters

labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct. The values are lists that can contain pairs of records
file_obj (TextIO) – file object to write training data to

examples = {
    "match": [
         ({'name' : 'Georgie Porgie'}, {'name' : 'George Porgie'}),
    ],
    "distinct": [
        ({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'}),
    ],
}
with open('training.json', 'w') as f:
    dedupe.write_training(examples, f)

Library Documentation

Dedupe Objects

StaticDedupe Objects

RecordLink Objects

StaticRecordLink Objects

Gazetteer Objects

StaticGazetteer Objects

Lower Level Classes and Methods

Dedupe and StaticDedupe

RecordLink and StaticRecordLink

Gazetteer and StaticGazetteer

Fingerprinter Objects

Convenience Functions

`Dedupe` Objects

`StaticDedupe` Objects

`RecordLink` Objects

`StaticRecordLink` Objects

`Gazetteer` Objects

`StaticGazetteer` Objects

`Fingerprinter` Objects