How to join datasets correctly is well understood. But we continue to do it badly. Here are notes on how to do fuzzy joins and long to wide transformations starting with some general preprocessing advice.
Preprocessing keys and Tables
- Remove stuff irrelevant to matching, e.g., font style, spaces, articles (the, a, an), etc.
- Standardize different versions of the same word. One common problem is abbreviations, e.g., Township and Twp., Saint and St., etc. Common misspellings can also be handled similarly. But do all this with care. For example, St. may stand for state and saint.
- Remove duplicates if many-to-one (or one-to-many or many-to-many) matches are not allowed. Note that our ability to detect duplicates will be limited by how dirty the data are within each dataset.
- Often, a few common problems are at the heart of most issues. For instance, say that you are matching on country, state, and city. Noise in how the countries are spelled across datasets may explain missed matches. Fix those.
Fuzzy Join
Problems with joining data are mostly uniqueness problems. Fuzzy joins add to this complication because they implicitly do a 1-to-many and a many-to-1 join. After a fuzzy join with a reasonably large distance metric, one common workflow is to remove rows where the left table key is duplicated and where the right table key is duplicated. (You likely want to remove all rows with the duplicated key than keep the first row, which is the default behavior of many deduplication functions.) Another common tactic is to manually assess these cases, which also addresses the case where the closest match is not the right match. If manually reviewing all the rows isn’t an option, code a small random sample and figure out the optimal distance that gives reasonable FP/FN. If you are building software for manual review: If you are building software for manual review:
- Arrange matches intelligently – for example, string + number, followed by string, etc. When there are more than 5 matches, arranging alphabetically works well as a default.
- If you have lots of data, consider Mechanical Turk or Captcha.
To improve the speed and tractability of fuzzy joins, there are there common tricks: 1. reduce the number of potential strings to match by exact matching on some variables (which you can hand curate), 2. multi-processing with shared memory, 3. eliminating the exact matches by doing an inner join and removing those keys.
Long to Wide
Converting data from long to wide is the same as joining the underlying group datasets. Most software implementations, however, assume too much and do not provide a workflow that assumes that we are joining datasets. In particular, many software implementations assume that the “key” has no missing values or duplicates. (For time series data, the software may assume that all the dates exist.) It is useful to test non-missingness and uniqueness of keys and to explicitly set a variable as the primary key. It is also good to think what the imputed value of a column should be if the “key” is missing for some group, e.g., years. You could also limit the rows to where all the years are present.