goji berries

Quality Data: Plumbing ML Data Pipelines

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in notebook This is translation.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Exit mobile version