One of the most rewarding aspects of working on the MDS program has been the close collaboration between my home department, computer science, and the statistics department here at UBC. The collaboration has also come with a challenge, though: the two communities often use different words to mean the same thing. A prime example would be the word “bias” – it is hard to get an unbiased opinion on what this word means! Over the last couple of years I recorded these discrepancies in a “statistics / machine learning dictionary” and also expanded the scope of the document to include general terminology issues that arise in data science. The current version of the document can be found here. In general, terminology is a tricky business because there aren’t always objective truths and yet people tend to feel quite passionate about it.
In the dictionary I did not attempt to delve into notation, but this too varies a lot across the disciplines. Linear regression forms a good example. In statistics, the features are typically represented by “X” and the coefficients by “β” (the Greek letter beta). In machine learning, we often use the letter “w” instead of “β”, which in fact reflects our use of the word “weights” rather than “coefficients”. Thus, there are sometimes connections between terminology and notation. To make matters more exciting, in the scientific computing community the coefficients are typically “x” and the features “A”. This reflects the community’s emphasis on linear regression as an optimization problem in which, as usual, we seek to find “x”.
One other question I didn’t dare attempt to answer in the dictionary, but which I am frequently asked, is on the difference between statistics and machine learning. I have found these two fields to have significant overlap, but often different emphases and methodologies. The exact differences aside, it’s become abundantly clear to me that both fields have a lot to contribute to data science, and I’m proud to be part of a data science program that brings these fields together. Indeed, one of my goals over the next few years is to foster a student experience in which the line between “statistics courses” and “CS courses” becomes increasingly blurred, and connections between them increasingly apparent.
Mike Gelbart is a Teaching Fellow in the MDS program and a Lecturer in the Department of Computer Science at UBC.