Variable Definitions

Variable Types

A variable definition describes the records that you want to match. It is a dictionary where the keys are the fields and the values are the field specification. For example:-

[
    {'field': 'Site name', 'type': 'String'},
    {'field': 'Address', 'type': 'String'},
    {'field': 'Zip', 'type': 'ShortString', 'has missing': True},
    {'field': 'Phone', 'type': 'String', 'has missing': True}
]

String Types

A String type field must declare the name of the record field to compare a String type declaration. The String type expects fields to be of class string.

String types are compared using string edit distance, specifically affine gap string distance. This is a good metric for measuring fields that might have typos in them, such as “John” vs “Jon”.

For example:-

{'field': 'Address', type: 'String'}

ShortString Types

A ShortString type field is just like String types except that dedupe will not try to learn any index blocking rules for these fields, which can speed up the training phase considerably.

Zip codes and city names are good candidates for this type. If in doubt, always use String.

For example:-

{'field': 'Zipcode', type: 'ShortString'}

Text Types

If you want to compare fields containing blocks of text e.g. product descriptions or article abstracts, you should use this type. Text type fields are compared using the cosine similarity metric.

This is a measurement of the amount of words that two documents have in common. This measure can be made more useful as the overlap of rare words counts more than the overlap of common words.

Compare this to String and ShortString types: For strings containing occupations, “yoga teacher” might be fairly similar to “yoga instructor” when using the Text measurement, because they both contain the relatively rare word of “yoga”. However, if you compared these two strings using the String or ShortString measurements, they might be considered fairly dis-similar, because the actual string edit distance between them is large.

If provided a sequence of example fields (i.e. a corpus) then dedupe will learn these weights for you. For example:-

{
 'field': 'Product description',
 'type': 'Text',
 'corpus' : [
         'this product is great',
         'this product is great and blue'
     ]
}

If you don’t want to adjust the measure to your data, just leave ‘corpus’ out of the variable definition entirely.

{'field': 'Product description', 'type': 'Text'}

Custom Types

A Custom type field must have specify the field it wants to compare, a type declaration of Custom, and a comparator declaration. The comparator must be a function that can take in two field values and return a number.

For example, a custom comparator:

def same_or_not_comparator(field_1, field_2):
  if field_1 and field_2 :
      if field_1 == field_2 :
          return 0
      else:
          return 1

The corresponding variable definition:

{
    'field': 'Zip',
    'type': 'Custom',
    'comparator': same_or_not_comparator
 }

Custom fields do not have any blocking rules associated with them. Since dedupe needs blocking rules, a data model that only contains Custom fields will raise an error.

LatLong

A LatLong type field must have as the name of a field and a type declaration of LatLong. LatLong fields are compared using the Haversine Formula.

A LatLong type field must consist of tuples of floats corresponding to a latitude and a longitude.

{'field': 'Location', 'type': 'LatLong'}

Set

A Set type field is for comparing lists of elements, like keywords or client names. Set types are very similar to Text Types. They use the same comparison function and you can also let dedupe learn which terms are common or rare by providing a corpus. Within a record, a Set type field has to be hashable sequences like tuples or frozensets.

{
    'field': 'Co-authors',
    'type': 'Set',
    'corpus' : [
            ('steve edwards'),
            ('steve edwards', 'steve jobs')
        ]
 }

or

{'field': 'Co-authors', 'type': 'Set'}

Interaction

An Interaction field multiplies the values of the multiple variables. An Interaction variable is created with type declaration of Interaction and an interaction variables declaration.

The interaction variables field must be a sequence of variable names of other fields you have defined in your variable definition.

Interactions are good when the effect of two predictors is not simply additive.

[
    { 'field': 'Name', 'variable name': 'name', 'type': 'String' },
    { 'field': 'Zip', 'variable name': 'zip', 'type': 'Custom',
  'comparator' : same_or_not_comparator },
    {'type': 'Interaction', 'interaction variables': ['name', 'zip']}
]

Exact

Exact variables measure whether two fields are exactly the same or not.

{'field': 'city', 'type': 'Exact'}

Exists

Exists variables are useful if the presence or absence of a field tells you something meaningful about a pair of records. It differentiates between three different cases:

The field is missing in both records.

The field is missing in one of the records.

The field is present in neither of the records.

{'field': 'first_name', 'type': 'Exists'}

Categorical

Categorical variables are useful when you are dealing with qualitatively different types of things. For example, you may have data on businesses and you find that taxi cab businesses tend to have very similar names but law firms don’t. Categorical variables would let you indicate whether two records are both taxi companies, both law firms, or one of each. This is also a good choice for fields that are booleans, e.g. “True” or “False”.

Dedupe would represent these three possibilities using two dummy variables:

taxi-taxi      0 0
lawyer-lawyer  1 0
taxi-lawyer    0 1

A categorical field declaration must include a list of all the different strings that you want to treat as different categories.

So if you data looks like this:-

'Name'          'Business Type'
AAA Taxi        taxi
AA1 Taxi        taxi
Hindelbert Esq  lawyer

You would create a definition such as:

{
    'field': 'Business Type',
    'type': 'Categorical',
    'categories' : ['taxi', 'lawyer']
}

Price

Price variables are useful for comparing positive, non-zero numbers like prices. The values of Price field must be a positive float. If the value is 0 or negative, then an exception will be raised.

{'field': 'cost', 'type': 'Price'}

Optional Variables

These variables aren’t included in the core of dedupe, but are available to install separately if you want to use them.

In addition to the several variables below, you can find more optional variables on GitHub.

DateTime

DateTime variables are useful for comparing dates and timestamps. This variable can accept strings or Python datetime objects as inputs.

The DateTime variable definition accepts a few optional arguments that can help improve behavior if you know your field follows an unusual format:

fuzzy - Use fuzzy parsing to automatically extract dates from strings like “It happened on June 2nd, 2018” (default True)
dayfirst - Ambiguous dates should be parsed as dd/mm/yy (default False)
yearfirst- Ambiguous dates should be parsed as yy/mm/dd (default False)

Note that the DateTime variable defaults to mm/dd/yy for ambiguous dates. If both dayfirst and yearfirst are set to True, then dayfirst will take precedence.

For example, a sample DateTime variable definition, using the defaults:

{
    'field': 'time_of_sale',
    'type': 'DateTime',
    'fuzzy': True,
    'dayfirst': False,
    'yearfirst': False
}

If you’re happy with the defaults, you can simply define the field and type:

{'field': 'time_of_sale', 'type': 'DateTime'}

Install the dedupe-variable-datetime package for DateTime Type. For more info, see the GitHub Repository.

Address Type

An Address variable should be used for United States addresses. It uses the usaddress package to split apart an address string into components like address number, street name, and street type and compares component to component.

For example:-

{'field': 'address', 'type': 'Address'}

Install the dedupe-variable-address package for Address Type. For more info, see the GitHub Repository.

Name Type

A Name variable should be used for a field that contains American names, corporations and households. It uses the probablepeople package to split apart an name string into components like give name, surname, generational suffix, for people names, and abbreviation, company type, and legal form for corporations.

For example:-

{'field': 'name', 'type': 'Name'}

Install the dedupe-variable-name package for Name Type. For more info, see the GitHub Repository.

Fuzzy Category

A FuzzyCategorical variable should be used for when you for categorical data that has variations.

Occupations are an example, where the you may have ‘Attorney’, ‘Counsel’, and ‘Lawyer’. For this variable type, you need to supply a corpus of records that contain your focal record and other field types. This corpus should either be all the data you are trying to link or a representative sample.

For example:-

{
 'field': 'occupation',
 'type': 'FuzzyCategorical',
 'corpus' : [
        {'name' : 'Jim Doe', 'occupation' : 'Attorney'},
        {'name' : 'Jim Doe', 'occupation' : 'Lawyer'}
    ]
}

Install the dedupe-variable-fuzzycategory package for the FuzzyCategorical Type. For more info, see the GitHub Repository.

Missing Data

If the value of field is missing, that missing value should be represented as a None object. You should also use None to represent empty strings (eg '').

[
     {'Name': 'AA Taxi', 'Phone': '773.555.1124'},
     {'Name': 'AA Taxi', 'Phone': None},
     {'Name': None, 'Phone': '773-555-1123'}
]

If you want to model this missing data for a field, you can set 'has missing' : True in the variable definition. This creates a new, additional field representing whether the data was present or not and zeros out the missing data.

If there is missing data, but you did not declare 'has missing' : True then the missing data will simply be zeroed out and no field will be created to account for missing data.

This approach is called ‘response augmented data’ and is described in Benjamin Marlin’s thesis “Missing Data Problems in Machine Learning”. Basically, this approach says that, even without looking at the value of the field comparisons, the pattern of observed and missing responses will affect the probability that a pair of records are a match.

This approach makes a few assumptions that are usually not completely true:

Whether a field is missing data is not associated with any other field missing data.
That the weighting of the observed differences in field A should be the same regardless of whether field B is missing.

If you define an an interaction with a field that you declared to have missing data, then has missing : True will also be set for the Interaction field.

Longer example of a variable definition:

[
    {'field': 'name', 'variable name' : 'name', 'type': 'String'},
    {'field': 'address', 'type': 'String'},
    {'field': 'city', 'variable name' : 'city', 'type': 'String'},
    {'field': 'zip', 'type': 'Custom', 'comparator' : same_or_not_comparator},
    {'field': 'cuisine', 'type': 'String', 'has missing': True}
    {'type': 'Interaction', 'interaction variables' : ['name', 'city']}
]

Multiple Variables comparing same field

It is possible to define multiple variables that all compare the same variable.

For example:-

[
    {'field': 'name', 'type': 'String'},
    {'field': 'name', 'type': 'Text'}
]

Will create two variables that both compare the ‘name’ field but in different ways.

Optional Edit Distance

For String, ShortString, Address, and Name fields, you can choose to use the a conditional random field distance measure for strings. This measure can give you more accurate results but is much slower than the default edit distance.

{'field': 'name', 'type': 'String', 'crf': True}