Library Documentation
Dedupe
Objects
- class dedupe.Dedupe(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]
Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity.
- Parameters:
variable_definition (
Collection
[Variable
]) – A list of Variable objects describing the variables will be used for training a model. See Variable Definitionsnum_cores (
int
|None
) – The number of cpus to use for parallel processing. If set toNone
, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods# initialize from a defined set of fields variables = [ dedupe.variables.String("Site name"), dedupe.variables.String("Address"), dedupe.variables.String("Zip", has_missing=True), dedupe.variables.String("Phone", has_missing=True), ] deduper = dedupe.Dedupe(variables)
- prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9)[source]
Initialize the active learner with your data and, optionally, existing training data.
Sets up the learner.
- Parameters:
data (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field namestraining_file (
TextIO
|None
) – file object containing training datasample_size (
int
) – Size of the sample to drawblocked_proportion (
float
) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.
Examples
>>> matcher.prepare_training(data_d, 150000, .5)
>>> with open('training_file.json') as f: >>> matcher.prepare_training(data_d, training_file=f)
- uncertain_pairs()
Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
Examples
>>> pair = matcher.uncertain_pairs() >>> print(pair) [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
- mark_pairs(labeled_pairs)
Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
- Parameters:
labeled_pairs (
TrainingData
) – A dictionary with two keys,match
anddistinct
the values are lists that can contain pairs of records
Examples
>>> labeled_examples = { >>> "match": [], >>> "distinct": [ >>> ( >>> {"name": "Georgie Porgie"}, >>> {"name": "Georgette Porgette"}, >>> ) >>> ], >>> } >>> matcher.mark_pairs(labeled_examples)
Note
mark_pairs()
is primarily designed to be used withuncertain_pairs()
to incrementally build a training set.If you have existing training data, you should likely format the data into the right form and supply the training data to the
prepare_training()
method with thetraining_file
argument.If that is not possible or desirable, you can use
mark_pairs()
to train a linker with existing data. However, you must ensure that every record that appears in thelabeled_pairs
argument appears in either the data or training file supplied to theprepare_training()
method.
- train(recall=1.0, index_predicates=True)
Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.
- Parameters:
recall (
float
) –The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0.
index_predicates (
bool
) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.
- write_training(file_obj)
Write a JSON file that contains labeled examples
- Parameters:
file_obj (
TextIO
) – file object to write training data to
Examples
>>> with open('training.json', 'w') as f: >>> matcher.write_training(f)
- write_settings(file_obj)
Write a settings file containing the data model and predicates to a file object
- Parameters:
file_obj (
BinaryIO
) – file object to write settings data into
Examples
>>> with open('learned_settings', 'wb') as f: >>> matcher.write_settings(f)
- cleanup_training()
Clean up data we used for training. Free up memory.
- partition(data, threshold=0.5)
Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.
For details on the confidence score, see
dedupe.Dedupe.cluster()
.This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to
score()
.- Parameters:
data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
threshold –
Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision
Examples
>>> duplicates = matcher.partition(data, threshold=0.5) >>> duplicates [ ((1, 2, 3), (0.790, 0.860, 0.790)), ((4, 5), (0.720, 0.720)), ((10, 11), (0.899, 0.899)), ]
StaticDedupe
Objects
- class dedupe.StaticDedupe(settings_file, num_cores=None, in_memory=False, **kwargs)[source]
Class for deduplication using saved settings. If you have already trained a
Dedupe
object and saved the settings, you can load the saved settings with StaticDedupe.- Parameters:
settings_file (
BinaryIO
) – A file object containing settings info produced from thewrite_settings()
method.num_cores (
int
|None
) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methodswith open('learned_settings', 'rb') as f: matcher = StaticDedupe(f)
- partition(data, threshold=0.5)
Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.
For details on the confidence score, see
dedupe.Dedupe.cluster()
.This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to
score()
.- Parameters:
data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
threshold –
Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision
Examples
>>> duplicates = matcher.partition(data, threshold=0.5) >>> duplicates [ ((1, 2, 3), (0.790, 0.860, 0.790)), ((4, 5), (0.720, 0.720)), ((10, 11), (0.899, 0.899)), ]
RecordLink
Objects
- class dedupe.RecordLink(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]
Class for active learning record linkage.
Use RecordLinkMatching when you have two datasets that you want to join.
- Parameters:
variable_definition (
Collection
[Variable
]) – A list of Variable objects describing the variables will be used for training a model. See Variable Definitionsnum_cores (
int
|None
) – The number of cpus to use for parallel processing. If set toNone
, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods# initialize from a defined set of fields variables = [ dedupe.variables.String("Site name"), dedupe.variables.String("Address"), dedupe.variables.String("Zip", has_missing=True), dedupe.variables.String("Phone", has_missing=True), ] deduper = dedupe.RecordLink(variables)
- prepare_training(data_1, data_2, training_file=None, sample_size=1500, blocked_proportion=0.9)
Initialize the active learner with your data and, optionally, existing training data.
- Parameters:
data_1 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field namesdata_2 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from second dataset, same form as data_1training_file (
TextIO
|None
) – file object containing training datasample_size (
int
) – The size of the sample to draw.blocked_proportion (
float
) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.
Examples
>>> matcher.prepare_training(data_1, data_2, 150000)
or
>>> with open('training_file.json') as f: >>> matcher.prepare_training(data_1, data_2, training_file=f)
- uncertain_pairs()
Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
Examples
>>> pair = matcher.uncertain_pairs() >>> print(pair) [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
- mark_pairs(labeled_pairs)
Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
- Parameters:
labeled_pairs (
TrainingData
) – A dictionary with two keys,match
anddistinct
the values are lists that can contain pairs of records
Examples
>>> labeled_examples = { >>> "match": [], >>> "distinct": [ >>> ( >>> {"name": "Georgie Porgie"}, >>> {"name": "Georgette Porgette"}, >>> ) >>> ], >>> } >>> matcher.mark_pairs(labeled_examples)
Note
mark_pairs()
is primarily designed to be used withuncertain_pairs()
to incrementally build a training set.If you have existing training data, you should likely format the data into the right form and supply the training data to the
prepare_training()
method with thetraining_file
argument.If that is not possible or desirable, you can use
mark_pairs()
to train a linker with existing data. However, you must ensure that every record that appears in thelabeled_pairs
argument appears in either the data or training file supplied to theprepare_training()
method.
- train(recall=1.0, index_predicates=True)
Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.
- Parameters:
recall (
float
) –The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0.
index_predicates (
bool
) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.
- write_training(file_obj)
Write a JSON file that contains labeled examples
- Parameters:
file_obj (
TextIO
) – file object to write training data to
Examples
>>> with open('training.json', 'w') as f: >>> matcher.write_training(f)
- write_settings(file_obj)
Write a settings file containing the data model and predicates to a file object
- Parameters:
file_obj (
BinaryIO
) – file object to write settings data into
Examples
>>> with open('learned_settings', 'wb') as f: >>> matcher.write_settings(f)
- cleanup_training()
Clean up data we used for training. Free up memory.
- join(data_1, data_2, threshold=0.5, constraint='one-to-one')
Identifies pairs of records that refer to the same entity.
Returns pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to the
score()
.- Parameters:
data_1 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field namesdata_2 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from second dataset, same form as data_1threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
constraint (
Literal
['one-to-one'
,'many-to-one'
,'many-to-many'
]) –What type of constraint to put on a join.
- ’one-to-one’
Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. This is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.
- ’many-to-one’
Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. This is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.
- ’many-to-many’
Every record in data_1 can match multiple records in data_2 and vice versa. This is like a SQL inner join.
Examples
>>> links = matcher.join(data_1, data_2, threshold=0.5) >>> list(links) [ ((1, 2), 0.790), ((4, 5), 0.720), ((10, 11), 0.899) ]
StaticRecordLink
Objects
- class dedupe.StaticRecordLink(settings_file, num_cores=None, in_memory=False, **kwargs)[source]
Class for record linkage using saved settings. If you have already trained a RecordLink instance, you can load the saved settings with StaticRecordLink.
- Parameters:
settings_file (
BinaryIO
) – A file object containing settings info produced from thewrite_settings()
method.num_cores (
int
|None
) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methodswith open('learned_settings', 'rb') as f: matcher = StaticRecordLink(f)
- join(data_1, data_2, threshold=0.5, constraint='one-to-one')
Identifies pairs of records that refer to the same entity.
Returns pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to the
score()
.- Parameters:
data_1 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field namesdata_2 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from second dataset, same form as data_1threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
constraint (
Literal
['one-to-one'
,'many-to-one'
,'many-to-many'
]) –What type of constraint to put on a join.
- ’one-to-one’
Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. This is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.
- ’many-to-one’
Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. This is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.
- ’many-to-many’
Every record in data_1 can match multiple records in data_2 and vice versa. This is like a SQL inner join.
Examples
>>> links = matcher.join(data_1, data_2, threshold=0.5) >>> list(links) [ ((1, 2), 0.790), ((4, 5), 0.720), ((10, 11), 0.899) ]
Gazetteer
Objects
- class dedupe.Gazetteer(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]
Class for active learning gazetteer matching.
Gazetteer matching is for matching a messy data set against a ‘canonical dataset’. This class is useful for such tasks as matching messy addresses against a clean list
- Parameters:
variable_definition (
Collection
[Variable
]) – A list of Variable objects describing the variables will be used for training a model. See Variable Definitionsnum_cores (
int
|None
) – The number of cpus to use for parallel processing. If set toNone
, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods# initialize from a defined set of fields variables = [ dedupe.variables.String("Site name"), dedupe.variables.String("Address"), dedupe.variables.String("Zip", has_missing=True), dedupe.variables.String("Phone", has_missing=True), ] matcher = dedupe.Gazetteer(variables)
- prepare_training(data_1, data_2, training_file=None, sample_size=1500, blocked_proportion=0.9)
Initialize the active learner with your data and, optionally, existing training data.
- Parameters:
data_1 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field namesdata_2 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from second dataset, same form as data_1training_file (
TextIO
|None
) – file object containing training datasample_size (
int
) – The size of the sample to draw.blocked_proportion (
float
) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.
Examples
>>> matcher.prepare_training(data_1, data_2, 150000)
or
>>> with open('training_file.json') as f: >>> matcher.prepare_training(data_1, data_2, training_file=f)
- uncertain_pairs()
Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
Examples
>>> pair = matcher.uncertain_pairs() >>> print(pair) [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
- mark_pairs(labeled_pairs)
Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
- Parameters:
labeled_pairs (
TrainingData
) – A dictionary with two keys,match
anddistinct
the values are lists that can contain pairs of records
Examples
>>> labeled_examples = { >>> "match": [], >>> "distinct": [ >>> ( >>> {"name": "Georgie Porgie"}, >>> {"name": "Georgette Porgette"}, >>> ) >>> ], >>> } >>> matcher.mark_pairs(labeled_examples)
Note
mark_pairs()
is primarily designed to be used withuncertain_pairs()
to incrementally build a training set.If you have existing training data, you should likely format the data into the right form and supply the training data to the
prepare_training()
method with thetraining_file
argument.If that is not possible or desirable, you can use
mark_pairs()
to train a linker with existing data. However, you must ensure that every record that appears in thelabeled_pairs
argument appears in either the data or training file supplied to theprepare_training()
method.
- train(recall=1.0, index_predicates=True)
Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.
- Parameters:
recall (
float
) –The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0.
index_predicates (
bool
) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.
- write_training(file_obj)
Write a JSON file that contains labeled examples
- Parameters:
file_obj (
TextIO
) – file object to write training data to
Examples
>>> with open('training.json', 'w') as f: >>> matcher.write_training(f)
- write_settings(file_obj)
Write a settings file containing the data model and predicates to a file object
- Parameters:
file_obj (
BinaryIO
) – file object to write settings data into
Examples
>>> with open('learned_settings', 'wb') as f: >>> matcher.write_settings(f)
- cleanup_training()
Clean up data we used for training. Free up memory.
- index(data)
Add records to the index of records to match against. If a record in
canonical_data
has the same key as a previously indexed record, the old record will be replaced.- Parameters:
data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
- unindex(data)
Remove records from the index of records to match against.
- Parameters:
data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
- search(data, threshold=0.0, n_matches=1, generator=False)
Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.
- Parameters:
data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
threshold –
a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to
None
all possible matches above the threshold will be returned.generator – when
True
, match will generate a sequence of possible matches, instead of a list.
Examples
>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2) >>> print(matches) [ (((1, 6), 0.72), ((1, 8), 0.6)), (((2, 7), 0.72),), (((3, 6), 0.72), ((3, 8), 0.65)), (((4, 6), 0.96), ((4, 5), 0.63)), ]
StaticGazetteer
Objects
- class dedupe.StaticGazetteer(settings_file, num_cores=None, in_memory=False, **kwargs)[source]
Class for gazetter matching using saved settings.
If you have already trained a
Gazetteer
instance, you can load the saved settings with StaticGazetteer.- Parameters:
settings_file (
BinaryIO
) – A file object containing settings info produced from thewrite_settings()
method.num_cores (
int
|None
) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.in_memory (
bool
) – If True,dedupe.Dedupe.pairs()
will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.
Warning
If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a
if __name__ == '__main__'
in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methodswith open('learned_settings', 'rb') as f: matcher = StaticGazetteer(f)
- index(data)
Add records to the index of records to match against. If a record in
canonical_data
has the same key as a previously indexed record, the old record will be replaced.- Parameters:
data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
- unindex(data)
Remove records from the index of records to match against.
- Parameters:
data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
- search(data, threshold=0.0, n_matches=1, generator=False)
Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.
- Parameters:
data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
threshold –
a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to
None
all possible matches above the threshold will be returned.generator – when
True
, match will generate a sequence of possible matches, instead of a list.
Examples
>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2) >>> print(matches) [ (((1, 6), 0.72), ((1, 8), 0.6)), (((2, 7), 0.72),), (((3, 6), 0.72), ((3, 8), 0.65)), (((4, 6), 0.96), ((4, 5), 0.63)), ]
- blocks(data)
Yield groups of pairs of records that share fingerprints.
Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.
Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly
many_to_n()
, assumes that every pair of records is compared no more than once.- Parameters:
data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
Examples
>>> blocks = matcher.pairs(data) >>> print(list(blocks) [ [ ( (1, {"name": "Pat", "address": "123 Main"}), (8, {"name": "Pat", "address": "123 Main"}), ), ( (1, {"name": "Pat", "address": "123 Main"}), (9, {"name": "Sam", "address": "123 Main"}), ), ], [ ( (2, {"name": "Sam", "address": "2600 State"}), (5, {"name": "Pam", "address": "2600 Stat"}), ), ( (2, {"name": "Sam", "address": "123 State"}), (7, {"name": "Sammy", "address": "123 Main"}), ), ], ]
- score(blocks)
Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.
- Parameters:
blocks – Iterator of blocks of records
- many_to_n(score_blocks, threshold=0.0, n_matches=1)
For each group of scored pairs, yield the highest scoring N pairs
- Parameters:
score_blocks (
Iterable
[Union
[memmap
,ndarray
]]) – Iterator of numpy structured arrays, each with a dtype of[('pairs', id_type, 2), ('score', 'f4')]
where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
n_matches (
int
) – How many top scoring pairs to select per group
Lower Level Classes and Methods
With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You’ll need to interact with some of the lower level classes and methods.
See also
The PostgreSQL and MySQL examples use these lower level classes and methods.
Dedupe and StaticDedupe
- class dedupe.Dedupe[source]
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- pairs(data)
Yield pairs of records that share common fingerprints.
Each pair will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly
cluster()
, assumes that every pair of records is compared no more than once.- Parameters:
data (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
Examples
>>> pairs = matcher.pairs(data) >>> list(pairs) [ ( (1, {"name": "Pat", "address": "123 Main"}), (2, {"name": "Pat", "address": "123 Main"}), ), ( (1, {"name": "Pat", "address": "123 Main"}), (3, {"name": "Sam", "address": "123 Main"}), ), ]
- score(pairs)
Scores pairs of records. Returns pairs of tuples of records id and associated probabilities that the pair of records are match
- Parameters:
pairs (
Union
[Iterator
[Tuple
[Tuple
[int
,Mapping
[str
,Any
]],Tuple
[int
,Mapping
[str
,Any
]]]],Iterator
[Tuple
[Tuple
[str
,Mapping
[str
,Any
]],Tuple
[str
,Mapping
[str
,Any
]]]]]) – Iterator of pairs of records
- cluster(scores, threshold=0.5)
From the similarity scores of pairs of records, decide which groups of records are all referring to the same entity.
Yields tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.
Each confidence scores is a measure of how similar the record is to the other records in the cluster. Let \(\phi(i,j)\) be the pair-wise similarity between records \(i\) and \(j\). Let \(N\) be the number of records in the cluster.
\[\text{confidence score}_i = 1 - \sqrt {\frac{\sum_{j}^N (1 - \phi(i,j))^2}{N -1}}\]This measure is similar to the average squared distance between the focal record and the other records in the cluster. These scores can be combined to give a total score for the cluster.
\[\text{cluster score} = 1 - \sqrt { \frac{\sum_i^N(1 - \mathrm{score}_i)^2 \cdot (N - 1) } { 2 N^2}}\]- Parameters:
scores (
Union
[memmap
,ndarray
]) –a numpy structured array with a dtype of
[('pairs', id_type, 2), ('score', 'f4')]
where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.For each pair, the smaller id should be first.
threshold (
float
) –Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision
Examples
>>> pairs = matcher.pairs(data) >>> scores = matcher.scores(pairs) >>> clusters = matcher.cluster(scores) >>> list(clusters) [ ((1, 2, 3), (0.790, 0.860, 0.790)), ((4, 5), (0.720, 0.720)), ((10, 11), (0.899, 0.899)), ]
- class dedupe.StaticDedupe[source]
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- pairs(data)
Same as
dedupe.Dedupe.pairs()
- score(pairs)
Same as
dedupe.Dedupe.score()
- cluster(scores, threshold=0.5)
Same as
dedupe.Dedupe.cluster()
RecordLink and StaticRecordLink
- class dedupe.RecordLink[source]
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- pairs(data_1, data_2)
Yield pairs of records that share common fingerprints.
Each pair will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly
one_to_one()
, andmany_to_one()
assumes that every pair of records is compared no more than once.- Parameters:
data_1 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field namesdata_2 (
Union
[Mapping
[int
,Mapping
[str
,Any
]],Mapping
[str
,Mapping
[str
,Any
]]]) – Dictionary of records from second dataset, same form as data_1
Examples
>>> pairs = matcher.pairs(data_1, data_2) >>> list(pairs) [ ( (1, {"name": "Pat", "address": "123 Main"}), (2, {"name": "Pat", "address": "123 Main"}), ), ( (1, {"name": "Pat", "address": "123 Main"}), (3, {"name": "Sam", "address": "123 Main"}), ), ]
- score(pairs)
Scores pairs of records. Returns pairs of tuples of records id and associated probabilities that the pair of records are match
- Parameters:
pairs (
Union
[Iterator
[Tuple
[Tuple
[int
,Mapping
[str
,Any
]],Tuple
[int
,Mapping
[str
,Any
]]]],Iterator
[Tuple
[Tuple
[str
,Mapping
[str
,Any
]],Tuple
[str
,Mapping
[str
,Any
]]]]]) – Iterator of pairs of records
- one_to_one(scores, threshold=0.0)
From the similarity scores of pairs of records, decide which pairs refer to the same entity.
Every record in data_1 can match at most one record from data_2 and every record from data_2 can match at most one record from data_1. See https://en.wikipedia.org/wiki/Injective_function.
This method is good for when both data_1 and data_2 are from different sources and you are interested in matching across the sources. If, individually, data_1 or data_2 have many duplicates you will not get good matches.
Yields pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
- Parameters:
scores (
Union
[memmap
,ndarray
]) –a numpy structured array with a dtype of
[('pairs', id_type, 2), ('score', 'f4')]
where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
Examples
>>> pairs = matcher.pairs(data) >>> scores = matcher.scores(pairs, threshold=0.5) >>> links = matcher.one_to_one(scores) >>> list(links) [ ((1, 2), 0.790), ((4, 5), 0.720), ((10, 11), 0.899) ]
- many_to_one(scores, threshold=0.0)
From the similarity scores of pairs of records, decide which pairs refer to the same entity.
Every record in data_1 can match at most one record from data_2, but more than one record from data_1 can match to the same record in data_2. See https://en.wikipedia.org/wiki/Surjective_function
This method is good for when data_2 is a lookup table and data_1 is messy, such as geocoding or matching against golden records.
Yields pairs of record ids with a confidence score as a float between 0 and 1. The record_ids within the pair should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
- Parameters:
scores (
Union
[memmap
,ndarray
]) –a numpy structured array with a dtype of
[('pairs', id_type, 2), ('score', 'f4')]
where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
Examples
>>> pairs = matcher.pairs(data) >>> scores = matcher.scores(pairs, threshold=0.5) >>> links = matcher.many_to_one(scores) >>> print(list(links)) [ ((1, 2), 0.790), ((4, 5), 0.720), ((7, 2), 0.623), ((10, 11), 0.899) ]
- class dedupe.StaticRecordLink[source]
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- pairs(data_1, data_2)
Same as
dedupe.RecordLink.pairs()
- score(pairs)
Same as
dedupe.RecordLink.score()
- one_to_one(scores, threshold=0.0)
Same as
dedupe.RecordLink.one_to_one()
- many_to_one(scores, threshold=0.0)
Same as
dedupe.RecordLink.many_to_one()
Gazetteer and StaticGazetteer
- class dedupe.Gazetteer[source]
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- blocks(data)
Yield groups of pairs of records that share fingerprints.
Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.
Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly
many_to_n()
, assumes that every pair of records is compared no more than once.- Parameters:
data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
Examples
>>> blocks = matcher.pairs(data) >>> print(list(blocks) [ [ ( (1, {"name": "Pat", "address": "123 Main"}), (8, {"name": "Pat", "address": "123 Main"}), ), ( (1, {"name": "Pat", "address": "123 Main"}), (9, {"name": "Sam", "address": "123 Main"}), ), ], [ ( (2, {"name": "Sam", "address": "2600 State"}), (5, {"name": "Pam", "address": "2600 Stat"}), ), ( (2, {"name": "Sam", "address": "123 State"}), (7, {"name": "Sammy", "address": "123 Main"}), ), ], ]
- score(blocks)
Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.
- Parameters:
blocks – Iterator of blocks of records
- many_to_n(score_blocks, threshold=0.0, n_matches=1)
For each group of scored pairs, yield the highest scoring N pairs
- Parameters:
score_blocks (
Iterable
[Union
[memmap
,ndarray
]]) –Iterator of numpy structured arrays, each with a dtype of
[('pairs', id_type, 2), ('score', 'f4')]
where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.threshold (
float
) –Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
n_matches (
int
) – How many top scoring pairs to select per group
- class dedupe.StaticGazeteer
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- blocks(data)
Same as
dedupe.Gazetteer.blocks()
- score(blocks)
Same as
dedupe.Gazetteer.score()
- many_to_n(score_blocks, threshold=0.0, n_matches=1)
Same as
dedupe.Gazetteer.many_to_n()
Fingerprinter
Objects
- class dedupe.blocking.Fingerprinter(predicates)[source]
Takes in a record and returns all blocks that record belongs to
- __call__(records, target=False)[source]
Generate the predicates for records. Yields tuples of (predicate, record_id).
- Parameters:
records – A sequence of tuples of (record_id, record_dict). Can often be created by
data_dict.items()
.target –
Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If
target
is set toTrue
, an search predicate will return the value itself. Iftarget
is set toFalse
the search predicate will return all possible values within the specified search distance.Let’s say we have a
LevenshteinSearchPredicate
with an associated distance of1
on a"name"
field; and we have a record like{"name": "thomas"}
. If thetarget
is set toTrue
then the predicate will return"thomas"
. Iftarget
is set toFalse
, then the blocker could return"thomas"
,"tomas"
, and"thoms"
. By using thetarget
argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})] > blocked_ids = deduper.fingerprinter(data) > print list(blocked_ids) [('foo:1', 1), ..., ('bar:1', 100)]
-
index_fields:
dict
[str
,DefaultDict
[str
,List
[IndexPredicate
]]] A dictionary of all the fingerprinter methods that use an index of data field values. The keys are the field names, which can be useful to know for indexing the data.
- index(docs, field)[source]
Add docs to the indices used by fingerprinters.
Some fingerprinter methods depend upon having an index of values that a field may have in the data. This method adds those values to the index. If you don’t have any fingerprinter methods that use an index, this method will do nothing.
- Parameters:
docs (
Union
[Iterable
[str
],Iterable
[Iterable
[str
]]]) – an iterator of values from your data to index. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.field (
str
) – fieldname or key associated with the values you are indexing
- unindex(docs, field)[source]
Remove docs from indices used by fingerprinters
- Parameters:
docs (
Union
[Iterable
[str
],Iterable
[Iterable
[str
]]]) – an iterator of values from your data to remove. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.field (
str
) – fieldname or key associated with the values you are unindexing
Convenience Functions
- dedupe.console_label(deduper)[source]
Train a matcher instance (Dedupe, RecordLink, or Gazetteer) from the command line. Example
> deduper = dedupe.Dedupe(variables) > deduper.prepare_training(data) > dedupe.console_label(deduper)
- dedupe.training_data_dedupe(data, common_key, training_size=50000)[source]
Construct training data for consumption by the func:
mark_pairs
method from an already deduplicated dataset.- Parameters:
data – Dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field names
common_key – The name of the record field that uniquely identifies a match
training_size – the rough limit of the number of training examples, defaults to 50000
Note
Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.
- dedupe.training_data_link(data_1, data_2, common_key, training_size=50000)[source]
Construct training data for consumption by the func:
mark_pairs
method from already linked datasets.- Parameters:
data_1 – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names
data_2 – Dictionary of records from second dataset, same form as data_1
common_key – The name of the record field that uniquely identifies a match
training_size – the rough limit of the number of training examples, defaults to 50000
Note
Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.
- dedupe.canonicalize(record_cluster)[source]
Constructs a canonical representation of a duplicate cluster by finding canonical values for each field
- Parameters:
record_cluster (
list
[Mapping
[str
,Any
]]) – A list of records within a duplicate cluster, where the records are dictionaries with field names as keys and field values as values
- dedupe.read_training(training_file)[source]
Read training from previously built training data file object
- Parameters:
training_file (
TextIO
) – file object containing the training data- Returns:
A dictionary with two keys,
match
anddistinct
. See the inverse,write_training()
.
- dedupe.write_training(labeled_pairs, file_obj)[source]
Write a JSON file that contains labeled examples
- Parameters:
labeled_pairs (
TrainingData
) – A dictionary with two keys,match
anddistinct
. The values are lists that can contain pairs of recordsfile_obj (
TextIO
) – file object to write training data to
examples = { "match": [ ({'name' : 'Georgie Porgie'}, {'name' : 'George Porgie'}), ], "distinct": [ ({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'}), ], } with open('training.json', 'w') as f: dedupe.write_training(examples, f)