Merging Data with Dirty Keys

8 Oct
  1. Remove stuff irrelevant to matching, e.g., font style, spaces, articles (the, a, an), etc.

  2. Standardize different versions of the same word. One common problem is abbreviations, e.g., Township and Twp., Saint and St., etc. Common misspellings can also be handled similarly. But do all this with care. For example, St. may stand for state and saint.

  3. Remove duplicates if many-to-one (or one-to-many or many-to-many) matches are not allowed. Note that our ability to detect duplicates will be limited by how dirty the data are within each dataset.

  4. What is to be matched may be one column but it may have multiple identifiers. Match each identifier separately if these identifiers can be in different order.

  5. If data are dirty in similar ways, then use complete automation to coerce the identifiers to be the same across datasets.

  6. If data are dirty in different ways (misspelled in different ways, abbreviated in different ways, etc.), produce an additional column that carries a numeric value of how close was the match. Producing similarity distances between strings, and making judgments based on those similarity distances, can be done using versions of distance measures between strings, for example, Levenshtein distance. But get baseline error rates using training data.

    One has two choices — semi-automation or complete automation. If the misclassification penalty is high, and one wants as many matches as possible, semi-automated solutions likely provide the best route. Resources can dictate what option one chooses.

Semi automation — Show best matches among which people can choose.

  1. Produce a list of possible matches via a liberal criterion so as not to miss many matches. Even if matches are missed, they can eventually be done manually. So there is a trade-off between the number of matches shown versus the number of possibilities that come up without a match.

  2. Arrange matches intelligently – for example, string + number, followed by string, etc. Where matches more than 5, arrange alphabetically.

  3. If you have lots of data, think about using Mechanical Turk or Captcha.