Important links

Important links#

Course learning outcomes#

This course is about identifying underlying structure in data. We will talk about clustering, data representation (e.g., dimensionality reduction and word embeddings), and recommendation systems.

Click to expand!

By the end of the course, students are expected to be able to

Explain the unsupervised paradigm.
Explain the intuition behind clustering and use appropriate clustering algorithms for applications such as image clustering and document clustering.
Interpret the results obtained after applying clustering.
Explain the intuition behind dimensionality reduction.
Broadly explain and use linear dimensionality reduction techniques such as PCA, LSA, and NMF.
Explain the intuition of word2vec model to create word embeddings.
Train your own word embeddings and use pre-trained word embeddings.
Explain the recommender systems problem.
Broadly explain and use two common approaches to recommender systems: collaborative filtering and content-based filtering.
Explain consequences of using recommender systems.

Deliverables#

Click to expand!

The following deliverables will determine your course grade:

Assessment	Weight	Where to submit
Lab Assignment 1	12%	Gradescope
Lab Assignment 2	12%	Gradescope
Lab Assignment 3	12%	Gradescope
Lab Assignment 4	12%	Gradescope
Class participation	2%	iClicker Cloud
Quiz 1	25%	PrairieLearn
Quiz 2	25%	PrairieLearn

See Calendar for the due dates.

Lectures#

Format#

Click to expand!

This class will follow a semi-flipped classroom format. For four out of the eight lectures, you will be required to watch a few pre-recorded videos (~30 to ~50 min long) before the lecture. All videos are available on YouTube and are linked in the Lecture Schedule below. During lectures, we’ll summarize the content from videos but I’ll assume that you understand the basic concepts from the videos and we will focus on more advanced material, iClicker exercises, discussions, demos, and class activities. It’s optional but highly recommended to download the appropriate datasets provided below and put them under your local lectures/data directory, and run the lecture Jupyter notebooks on your own and experiment with the code.

Tentative Lecture Schedule#

This course occurs during Block 5 in the 2021/22 school year.

Lecture	Topic	Assigned videos	Resources and optional readings
0	Course Information
1	K-Means and intro to GMMs	📹 Videos: 14.1, 14.2,14.3	`sklearn` clustering documentation ”Spaghetti Sauce” talk by Malcom Gladwell Visualizing-k-means-clustering Visualizing K-Means algorithm with D3.js Clustering with Scikit with GIFs
2	DBSCAN and Hierarchical Clustering	📹 Videos: 15.1, 15.2, 15.3	Comparison of sklearn clustering algorithms DBSCAN Visualization Clustering with Scikit with GIFs
3	Dimensionality Reduction Intro	📹 Videos: 17.1, 17.2, 17.3	PCA visualization Introduction to Machine Learning with Python book Chapter 3 Mike’s PCA video from CPSC 340 StatQuest PCA video
4	More PCA, LSA, NMF, Autoencoders	No videos
5	Word Vectors, Word Embeddings	📹 Videos: 18.1, 18.2, 18.3	Word2Vec papers: Distributed representations of words and phrases and their compositionality Efficient estimation of word representations in vector space word2vec Explained Debiasing Word Embeddings
6	Topic modeling	No videos	Dave Blei video lecture, paper
7	Recommender Systems I	No videos	Collaborative filtering for recommendation systems in Python, by N. Hug How Netflix’s Recommendations System Works
8	Recommender Systems II	No videos	SVDfeature

Datasets#

Here is the list of Kaggle datasets we’ll use in the lectures.

A small subset of 200 Bird Species with 11,788 Images (available here)
A tiny subset of Food-101 (available here)
Credit Card Dataset for Clustering
Countries of the World
Airline Sentiment
Jester 1.7M jokes ratings dataset
Amazon ratings data

If you want to be extra prepared, you may want to download these datasets in advance and save them under the lectures/data directory in your local copy of the repository.

Labs#

During labs, you will be given time to work on your own or in groups. There will be a lot of opportunity for discussion and getting help during lab sessions.

Installation#

We are providing you with a conda environment file which is available here. You can download this file and create a conda environment for the course and activate it as follows.

conda env create -f env-dsci-563.yml
conda activate 563

We’ve only attempted to install this environment file on a few machines, and you may encounter issues with certain packages from the yml file when executing the commands above. This is not uncommon and may suggest that the specified package version is not yet available for your operating system via conda. When this occurs, you have a couple of options:

Modify the local version of the yml file to remove the line containing that package.
Create the environment without that package.
Activate the environment and install the package manually either with conda install or pip install in the environment.

Note that this is not a complete list of the packages we’ll be using in the course and there might be a few packages you will be installing using conda install later in the course. But this is a good enough list to get you started.

Course communication#

Click to expand!

We all are here to help you learn and succeed in the course and the program. Here is how we’ll be communicating with each other during the course.

Clarifications on the lecture notes or lab questions#

If there is any clarification on the lecture material or lab questions, I’ll post a message on our course channel and tag you. It is your responsibility to read the messages whenever you are tagged. (I know that there are too many things for you to keep track of. You do not have to read all the messages but please make sure to carefully read the messages whenever you are tagged.)

Questions on lecture material or labs#

If you have questions about the lecture material or lab questions please post them on the course Slack channel rather than direct messaging me or the TAs. Here are the advantages of doing so:

You’ll get a quicker response.
Your classmates will benefit from the discussion.

When you ask your question on the course channel, please avoid tagging the instructor unless it’s specific for the instructor (e.g., if you notice some mistake in the lecture notes). If you tag a specific person, other teaching team members or your colleagues are discouraged to respond. This will decrease the response rate on the channel.

Please use some consistent convention when you ask questions on Slack to facilitate easy search for others or future you. For example, if you want to ask a question on Exercise 3.2 from Lab 1, start your post with the label lab1-ex2.3. Or if you have a question on lecture 2 material, start your post with the label lecture2. Once the question is answered/solved, you can add “(solved)” tag before the label (e.g., (solved) lab1-ex2.3). Do not delete your post even if you figure out the answer on your own. The question and the discussion can still be beneficial to others.

Reference Material#

Click to expand!

Books#

A Course in Machine Learning (CIML) by Hal Daumé III (also relevant for DSCI 572, 573, 575, 563)
Introduction to Machine Learning with Python: A Guide for Data Scientists by Andreas C. Mueller and Sarah Guido.
The Elements of Statistical Learning (ESL)
ML:APP,
LFD,
AI:AMA
An Introduction to Statistical Learning

Linear algebra review#

There are a bunch of suggestions here. We particularly recommend essence of linear algebra (YouTube series) and Immersive linear algebra (interactive e-book).
Introduction to Linear Algebra for Applied Machine Learning with Python

Online courses#

Mike’s CPSC 340
Machine Learning (Andrew Ng’s famous Coursera course)
Foundations of Machine Learning online course from Bloomberg.
Machine Learning Exercises In Python, Part 1 (translation of Andrew Ng’s course to Python, also relevant for DSCI 561, 572, 563)

Policies#

Please see the general MDS policies.

Important links

Contents

Important links#

Course learning outcomes#

Deliverables#

Lectures#

Format#

Tentative Lecture Schedule#

Datasets#

Labs#

Installation#

Course communication#

Clarifications on the lecture notes or lab questions#

Questions on lecture material or labs#

Questions related to grading#

Questions related to your personal situation or talking about sensitive information#

Reference Material#

Books#

Linear algebra review#

Online courses#

Policies#