Merging Data with Dirty Keys

8 Oct
  1. Remove stuff that is irrelevant to matching. It can be font style, spaces, articles (the, a, an) or something else.
  2. Standardize different versions of the same word. For e.g., Township and Twp. may be converted to twp, and Saint and St. to saint. Common misspellings can also be handled in a similar manner. But do all this with care. For example, St. may stand for state, in addition to Saint.
  3. Remove duplicates if many to one (or one to many or many to many) matches are not allowed. Note that our ability to detect duplicates will be limited by how dirty the data are within each dataset.
  4. What is to be matched may be one column but it may have multiple identifiers — match on those identifiers individually. This may provide additional leverage where these multiple identifiers are in different order. If order doesn’t matter — as in string + number = number + string, code it so.
  5. If data are dirty in similar ways, then use complete automation to coerce identifier to be same across datasets.
  6. If data are dirty in different ways (misspelled in different ways, abbreviated in different ways, etc.), produce an additional column that carries a numeric value of how close was the match. Producing similarity distances between strings, and making judgments based on those similarity distances, can be done using versions of distance measures between strings, for example — Levenshtein distance. But get baseline error rates using training data.

    One has two choices — semi or complete automation. If misclassification penalty is high, and one wants as many matches as one can get — then semi-automated solutions likely provide the best route. Resources can dictate what option one chooses.

Semi automation — Show best matches among which people can choose.

  1. Produce the list of possible matches via liberal criteria — so as not to miss (m)any matches. Even if matches are missed, they can be eventually done manually. So there is an optimization between number of matches to show versus number of possibilities that come up without a match.
  2. Arrange matches intelligently – for example string + number, followed by string, etc. Where matches more than 5, arrange alphabetically.
  3. If lots of data, think of using Mechanical Turk or Captcha.