Statistics-ML dictionary

One of the most rewarding aspects of working on the UBC Master of Data Science program has been the close collaboration between my home department, computer science, and the statistics department here at UBC. The collaboration has also come with a challenge, though: the two communities often use different words to mean the same thing. A prime example would be the word “bias” – it is hard to get an unbiased opinion on what this word means! Over the last couple of years I recorded these discrepancies in a “statistics / machine learning dictionary” and also expanded the scope of the document to include general terminology issues that arise in data science. The current version of the document can be found here. As one highlight, I’ve found that all of the following terms can refer to the same thing: predictors, features, inputs, explanatory variables, regressors, covariates, and independent variables! In general, terminology is a tricky business because there aren’t always objective truths and yet people tend to feel quite passionate about it.

In the dictionary I did not attempt to delve into notation, but this too varies a lot across the disciplines. Linear regression forms a good example. In statistics, the features are typically represented by “X” and the coefficients by “β” (the Greek letter beta). In machine learning, we often use the letter “w” instead of “β”, which I imagine reflects our use of the word “weights” rather than “coefficients”. Thus, there are sometimes connections between terminology and notation. To make matters more exciting, the scientific computing community typically denotes the coefficients as “x” and the features as “A”. This reflects the community’s emphasis on linear regression as an optimization problem in which, as usual, we seek to find “x”.

One other question, which I am frequently asked, is the difference between statistics and machine learning. It’s always a great moment when a student in my machine learning course asks, “isn’t this just statistics?” I have found these two fields to have significant overlap, but often different emphases and methodologies. The exact differences aside, it’s become abundantly clear to me that both fields have a lot to contribute to data science, and I’m proud to be part of a data science program that brings them together. One of my goals over the next few years is to foster a student experience in which the line between “statistics courses” and “CS courses” becomes increasingly blurred, and the connections between them increasingly apparent.

Link to Statistics-ML dictionary: https://ubc-mds.github.io/resources_pages/terminology/

Mike Gelbart is an Instructor in the MDS program and the Department of Computer Science at UBC.