DSCI 521: Computing Platforms for Data Science
How to install, maintain, and use the data scientific software “stack”. The Unix operating system, integrated development environments, and problem solving strategies.
Course Learning Objectives
- Use the Unix command line to navigate their computer’s filesystem.
- Define and distinguish between absolute file paths and relative file paths.
- Effectively use local and remote version control software (e.g., Git and GitHub) to organize projects and manage file versions.
- Create, edit and run reproducible literate Python and R code documents (e.g., reports and presentations) using Jupyter and RMarkdown.
- Write and edit Markdown and in-line LaTeX syntax within literate code documents.
- Define and correctly use a project working directory.
- Diagnose and troubleshoot programming and development environment problems, and explain how such problems can be avoided.
Lesson Learning Objectives and Readings
Lesson 0
Launch JupyterLab
Use the Notebook interface inside Jupyterlab
Know your way around the JupyterLab user interface
Launch RStudio
Use the
.R
script and RMarkdown.Rmd
interface inside RStudioKnow your way around the RStudio user interface
Lesson 1
- Recognize the directory hierarchy as it is commonly represented in diagrams, paths, and file explorer software.
- Distinguish common operators and representations of the different filesystem elements typically used in Bash.
- Explore the filesystem using Bash commands as
ls
,pwd
andcd
. - Translate an absolute path into a relative path and vice versa.
- Use command-line arguments to produce alternative outputs of commands.
- Create, edit, move, and delete files and folders using the command line and VS Code.
- The Unix Shell: Navigating Files and Directories or The shell
- Introduce yourself to Git
- Connect to GitHub
- Jupyter Notebook Tutorial: The Definitive Guide
Lesson 2
- Implement SSH authentication
- Differentiate between the use of GitHub as a remote hosting service for version control and Git as a version control system.
- Create a Git repository.
- Implement Git’s clone, add, status, commit, pull, and push operations on the command line and their equivalent use in VS Code.
- Understand what implies using the staging area in a Git workflow.
- Recognize the commit as the primary building block for storing a project version characterized by an attached message and a hash that serves as a unique identifier.
- Excuse me, do you have a moment to talk about version control? by Jenny Bryan
- Comparing commits across time
- Resolving a merge conflict using the command line
- nbdime – diffing and merging of Jupyter Notebooks
Lesson 3
- Explore the Git history via
git log
in the terminal and GitHub. - Compare commits using
git diff
in the terminal and GitHub. - Solve merge conflicts at the command line and in VS Code.
- Save transitory changes with
git stash
. - Manage to avoid pushing specific local files by including a
.gitignore
. - Differenciate among different ways to restore your project history (
git reset --hard/--soft
,git revert
) when working on an older version of a project.
- Comparing commits across time
- Resolving a merge conflict using the command line
- nbdime – diffing and merging of Jupyter Notebooks
Lesson 4
- Use the
quarto
terminal command to create different quarto projects - Create and edit a Quarto website
- Discover how GitHub can be used to serve static websites
- Modify a GitHub repository to publish a website
Lesson 5
- Create RProjects in RStudio using
here
to define robust file paths. - Detect the basic components of a dynamic document in Jupyter Notebooks and in R Markdown.
- Explain markdown usage in relation to dynamic documents.
- Differentiate between code chunks and code cells in RMarkdown and Jupyter Notebooks.
- Select appropriate code chunk options for RMarkdown.
- Use semantic line breaks for version control files.
- Specify metadata in the YAML header block.
- Using RStudio Projects
- R Markdown Cheat Sheet
- R Markdown home page
- R Markdown: The Definitive Guide
- R Markdown code chunk options
- What they forgot to teach you about R
- Connect RStudio to Git and GitHub
- R-cubed rostools workshop
Lesson 6
- Understand how Quarto extend R Markdown documents functionalities.
- Explore different data science products to communicate your results: slides, blogs and books.
- Create slides using Jupyter Notebook and Quarto slides with
reveal.js
- Create a Jupyter Book and a Quarto books.
- Create a Quarto Blog.
- Sharing rendered HTML files publicly via GitHub Pages.
Lesson 7
- Understand what is a computational environment and how can ensure the reproducibility of a project
- Differenciate Python, Anaconda, MiniConda, Conda and
pip
- Manage packages and environments in Python using Conda
- Manage packages and environments in R using
renv
Lesson 8
- Understand the basic syntax and functionality of regular expressions (regex) for pattern matching.
- Explore the use of special characters, ranges, and anchors in regex to match specific patterns within text.
- Apply regex to search, extract, and manipulate data in various formats using practical examples.
- Use regular expressions to navigate and organize files within the filesystem.
Attributions
Materials were inspired, re-used and re-mixed from the following sources: - Software Carpentry, specifically the Unix Shell and Git lessons - Happy Git and GitHub for the useR by Jenny Bryan and the STAT 545 TAs - Data 8: The Foundations of Data Science, specifically Lab 01 - Data Carpentry Reproducible Science Workshop
License
The UBC Master of Data Science DSCI 521: Computing Platforms for Data Science course materials here are licensed under the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA). If re-using/re-mixing please provide attribution and link to this webpage.