About this document
This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond. There is also an accompanying blog post. Course numbers of the form DSCI XXX refer to relevant UBC MDS courses.
This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).
- in ML,
regressionrefers to predicting continuous outputs given input features, and classification refers to predicting categorical outputs given input features.
- in statistics, both of the above tasks are referred to as
regression. See also
- even within statistics this word has a lot of meanings. See also bias of an estimator.
- in ML, when we have a trasformation of the form Wx+b (especially in linear models or neural networks) we refer to b as the “bias term” or the elements of b as the biases if b is a vector. For example, see this Stack Overflow post. In statistics we would call this the “intercept”.
- in both fields, we talk about the bias-variance tradeoff; see below.
- in statistics,
parameteris used to describe probability distributions, like “the gamma distribution has a shape parameter and a scale parameter”. Thus, a
parametric modelis a model using a parametric probability distribution.
- in ML,
parameterrefers to the components (usually numbers) that are getting learned in a system. A
parametric modelhas a fixed number of parameters that is independent of the number of training examples, and typically doesn’t require the training examples to be stored in order to make predictions. An example would be linear regression, which involves one parameter per feature plus one more intercept parameter. On the other hand, k-nearest neighbours (KNN) would be an example of a nonparametric model as we don’t “distill” the training data into a fixed set of parameters. Another way to think about this is that, with KNN, the complexity of the model grows with the amount of training data.
- the differences above can cause confusion. For example, a statistician might say a linear support vector machine (SVM) is not a parametric classifier because it is not based on an underlying probabilistic model. And yet, in ML, a linear SVM is parametric because we’re learning one parameter per dimension (in the primal formulation) to represent a linear boundary, and thus the number of parameters is fixed.
- in ML, this refers to terminating an optimization routine before reaching convergence, which may mitigate overfitting (see here).
- in statistics, this sometimes refers to stopping an experiment early, particularly as in early stopping of clinical trials.
- in statistics, this word evokes hypothesis testing.
- in ML,
hypothesisis sometimes used to refer to a particular model or decision boundary from the hypothesis space. E.g., we select a linear decision boundary from the hypothesis space of all possible hyperplanes; more complicated models have a larger hypothesis space.
- in statistics,
factormeans categorical variable; factorial experiment means trying all possible combinations of two or more factors.
- spanning both statistics and ML, we have factor analysis and factor graphs, which do not use
factorin the sense described above (or in the same way as each other, even…).
- (and in math, factor means one of several things being multiplied together.)
This is not a Stat vs. ML problem, but more of a Stat/ML vs. CS vs. math problem. For even more definitions, see the Wikipedia disambiguation page. Overall, this word ranks up there with
bias as a hopelessly overloaded and confusing word.
- in statistics and ML,
kernel methodsrefer to applications of the kernel trick, which is a computational speedup in cases where only the dot products of features are needed to make predictions. A common use case is kernel SVMs but they can be used in many other places.
- in density estimation and visualization,
kernel density estimationinvolves placing a kernel-shaped “bump” in place of each point. This is related to kernel methods mentioned above, and in particular kernel regression.
- in computer science, the word
kernelappears in the field of operating systems.
- as data scientists, we also see it in Jupyter notebooks: the “Python kernel”, etc. This is probably related to the CS meaning above?
- in linear algebra,
kernelis another name for nullspace.
- in statistics, it is standard to say “a sample of size N” when referring to data coming from a probability distribution.
- in machine learning, it is standard to say “N samples” to mean the same thing. E.g., “I will draw N samples from the exponential distribution”.
FDR: False Discovery Rate
MC: Monte Carlo
MCMC: Markov chain Monte Carlo
OOP: object-oriented programming
IID: independent and identically distributed
RV: random variable
CDF: cumulative distribution function
ROC: receiver operating characteristic (curve) – sort of like precision/recall
MAP: maximum a posteriori (DSCI 553), not to be confused with
mapof MapReduce (DSCI 513/525).
NLP: natural language processing
PCA: principal components analysis
EM: expectation maximization algorithm
Acronyms that have more than one meaning
S3: is an OOP paradigm in R (DSCI 524), and a storage service offered by Amazon (DSCI 525)
ML: Maximum Likelihood, Machine Learning
CI: Confidence Interval (DSCI 552), Continuous Integration (DSCI 524)
IP: Internet Protocol (as in IP address, DSCI 525), and Intellectual Property (as in software licenses, DSCI 524)
MDS: Master of Data Science, or Multi-Dimensional Scaling (DSCI 563).
Compare and contrast
underfittingrefers to a model that is too simple to perform well even on the training data.
overfittingrefers to a model that is too specific to the particular training set, potentially because the model is too complex.
bias-variance tradeoff vs
fundamental tradeoff of ML
bias-variance tradeoffrefers to the tradeoff between the bias and the variance of an estimator (see above for more on the term
fundamental tradeoff of MLrefers to the tradeoff between training error and approximation error.
- Bias-variance refers to a specific model more in the abstract, whereas training/approximation error refer to a specific model trained on a specific training set.
- High bias is roughly the same idea as high training error (underfitting). It means that the model assumptions prevent us from achieving low training error.
- High variance is roughly the same idea as high approximation error (overfitting). It means that model would change significantly if given different training data from the same distribution.
- following up on the ML definition of a
parameter(see above), we also contrast this with a
hyperparameter, which are high-level decisions that one chooses before training the model. For example, “k” in k-nearest neighbours classification is a hyperparameter. Unlike parameters, hyperparameters are most often chosen by humans (although hyperparameter tuning is becoming automated to some extent as well).
A discriminative model directly models the probability of a given output given the inputs. On the other hand, with a generative model one starts from assumptions about how the data are generated (the “forward model”), and then performs inference about the model given the data (the “backward” step). The term
generative model itself has some ambiguity: in some contexts it includes the prior distribution (see below) and in other contexts it does not.
This represents a divide in the field of statistics and is related to, but not the same as,
The Bayesian approach involves generative models and following Bayes’ Theorem. In this paradigm, one uses a generative model and attempts to compute a posterior distribution of the quantities of interest. See DSCI 553. The frequentist approach relates to hypothesis testing and does not involve prior distributions (which is both good and bad!). See DSCI 552.
In supervised learning you are given a set of input-output pairs and you try to model the relationship between them so that, given a new input, you can predict its output. In unsupervised learning there are no outputs. Your job is then to find some structure or pattern in the data. Sometimes it’s helpful to think of unsupervised learning as supervised learning except that you don’t know which features are the inputs and which are the output(s).
One (probably flawed) definition of
learning is updating your parameters given your data. The term
inference is perhaps slightly more specific, as it typically refers to inferring parameters of a model, either Bayesian or frequentist (see above). Here’s some more thoughts on this.
validation set vs.
There doesn’t seem to be a super strong concensus on what these terms mean. Most would agree that
training set is the data set that you train a model / learning algorithm on. A reasonable definition of
validation set is that it’s a subset of your training data that you hold out to “validate” your model, i.e. assess how well it performs on unseen (during training) data. Sometimes
test set is used as a synonym of
validation set; or sometimes it’s a second validation set that is only used once, after hyperparameter optimization is performed on the validation set; or sometimes it’s used to mean the feature values on which we’ll have to make predictions “in production”, in which case we’ll see the input values but never the true output values of the test set.
These are the same thing when referring to a random variable. However,
mean is sometimes used to refer to the sample mean. Therefore you can have the (sample) mean of the data you collected, but not the expected value. You can also discuss the mean of a bunch of numbers, like the mean (or average) of 1,2,3,4,5; but you wouldn’t say the expected value of these numbers.
length (of a vector)
dimension is unfortunately overloaded when it comes to vectors. From a math standpoint, we’re referring to the dimensionality of the space that the vector lives in, which is going to be number of elements in the vector. For example the vector
x = [1,2,3] is a vector in 3-dimensional space because there are 3 coordinates. So we could say its dimension is 3. However, from a programming point of view, we would call
x a 1-D array. In contrast, a “2-D array” is a rectangular structure with elements like
x[2,3]; in other words, what we call a matrix in math. In Python/numpy, the programming dimensionality can be evaluating with
x.ndim, and the math dimensionality with
length has its own problems too. In programming, the length of a vector is the number of elements; I also called this the
size or the math-dimensionality above. However, in math the length may refer to the L2-norm of a vector; that is, the square root of the sum of the squares of its elements, computed using
In general, we have to be careful with these terms. If you say “x is 3-dimensional”, are you referring to
Type I error: false positive, meaning it was actually negative but you said positive
Type II error: false negative, meaning it was actually positive but you said negative
recall: true positive rate. in math: 1 - (# of type II errors)/(# of true positives)
specificity: true negative rate. in math: 1 - (# of type I errors)/(# of true negatives)
positive predictive value: 1 - (# of type I errors)/(# reported positives)
sensitivity are highly related. But we tend to think of power as a conditional probability (a property of a statistical test) whereas we tend to think of sensitivity/recall as counts (a property of the predictions).
Lasso regression means regression using L1 regularization.
Ridge regression means regression using L2 regularization.
This can be thought of in terms of regularization. As an example, using L2 regularization in regression “shrinks” the coefficients. But it’s best not to interpret “shrink” as “make smaller in magnitude”. In Bayesian terms, a regularizer is viewed as a prior distribution. You could have a prior that believes the weights are near some non-zero value, and thus the prior “shrinks your beliefs to that value”. Thus,
shrinkage can be thought of as shrinking your uncertainty in the model parameters rather than literally making the weights smaller, although both become true when you are shrinking towards zero (which is almost always the case with regularization).
In each equivalence class below, the terms tend to be used interchangeably (even if this isn’t a good idea!).
- regression coefficients = weights = parameters
- residual variance = noise variance
- inputs = predictors = features = explanatory variables = regressors = independent variables = covariates = attributes
- outputs = outcomes = targets = response variables = dependent variables. Also called a label if it’s categorical.
- Gaussian distribution = Normal distribution = bell curve
- sensitivity = recall (see Error types above)
- training = learning = fitting (sort of; depending on context)
- ordinary least squares (OLS) = linear least squares = least squares
- softmax classification = softmax regression = multinomial logit regression
- kriging = Gaussian process regression
- Bayesian network = Bayes net = directed graphical model = DAG graphical model
- Markov random field = Markov network = undirected graphical model
- one-hot encoding = one-of-k encoding = dummy variables (sort of)
This is not really a terminology issue, but note a connection between
collaborative filtering in recommender systems (ML) and
data imputation (statistics). Both are about estimating missing values. But imputation is more of a means to an end whereas in collaborative filtering finding those values is often the end itself.