Interview with InfoQ

26 Apr

I recently gave an interview to InfoQ about my paper (and associated open source software) on predicting the race and ethnicity of a person using the sequence of characters in a name.

Here a relevant excerpt:

InfoQ: Can you discuss how we can learn from names? What ML/DL algorithms can we use?

Gaurav Sood:  Learning more about a person from their name is no different from tackling any other supervised ML problem. It all starts with getting (or creating) a large labeled corpus. For instance, one key innovation in ethnicolr is the training data—we use voting registration files to get a large labeled corpus. In another project on learning from names, I scraped Google Image Search results to build the training data for inferring the gender from a name.

Once you have the data, find ways to exploit patterns in the data to learn a model. Some early ventures exploited the fact that names of different kinds of people began/ended differently. For instance, female names in India often end with an ‘a,’ and you can exploit that pattern to infer gender from Indian names. In ethnicolr, we generalize this intuition and use patterns in sequences of characters. (I am also working on exploiting sequences of sounds.) Like Ye et al., you could also rely on the fact that we correspond more frequently with co-ethnics and exploit email networks for building your models.

To exploit the patterns in the data, the full-range of DL/ML tools is available to you. Use what works best.