DSCI 521: Computing Platforms for Data Science

How to install, maintain, and use the data scientific software “stack”. The Unix operating system, integrated development environments, and problem solving strategies.

Course Learning Objectives

  • Use the Unix command line to navigate their computer’s filesystem.
  • Define and distinguish between absolute file paths and relative file paths.
  • Effectively use local and remote version control software (e.g., Git and GitHub) to organize projects and manage file versions.
  • Create, edit and run reproducible literate Python and R code documents (e.g., reports and presentations) using Jupyter and RMarkdown.
  • Write and edit Markdown and in-line LaTeX syntax within literate code documents.
  • Define and correctly use a project working directory.
  • Diagnose and troubleshoot programming and development environment problems, and explain how such problems can be avoided.

Lesson Learning Objectives and Readings

Lesson 0

  1. Launch JupyterLab

  2. Use the Notebook interface inside Jupyterlab

  3. Know your way around the JupyterLab user interface

  4. Launch RStudio

  5. Use the .R script and RMarkdown .Rmd interface inside RStudio

  6. Know your way around the RStudio user interface

Lesson 1

  1. Recognize the directory hierarchy as it is commonly represented in diagrams, paths, and file explorer software.
  2. Distinguish common operators and representations of the different filesystem elements typically used in Bash.
  3. Explore the filesystem using Bash commands as ls, pwd and cd.
  4. Translate an absolute path into a relative path and vice versa.
  5. Use command-line arguments to produce alternative outputs of commands.
  6. Create, edit, move, and delete files and folders using the command line and VS Code.

Lesson 2

  1. Implement SSH authentication
  2. Differentiate between the use of GitHub as a remote hosting service for version control and Git as a version control system.
  3. Create a Git repository.
  4. Implement Git’s clone, add, status, commit, pull, and push operations on the command line and their equivalent use in VS Code.
  5. Understand what implies using the staging area in a Git workflow.
  6. Recognize the commit as the primary building block for storing a project version characterized by an attached message and a hash that serves as a unique identifier.

Lesson 3

  1. Explore the Git history via git log in the terminal and GitHub.
  2. Compare commits using git diff in the terminal and GitHub.
  3. Solve merge conflicts at the command line and in VS Code.
  4. Save transitory changes with git stash.
  5. Manage to avoid pushing specific local files by including a .gitignore.
  6. Differenciate among different ways to restore your project history (git reset --hard/--soft, git revert) when working on an older version of a project.

Lesson 4

  1. Use the quarto terminal command to create different quarto projects
  2. Create and edit a Quarto website
  3. Discover how GitHub can be used to serve static websites
  4. Modify a GitHub repository to publish a website

Lesson 5

  1. Create RProjects in RStudio using here to define robust file paths.
  2. Detect the basic components of a dynamic document in Jupyter Notebooks and in R Markdown.
  3. Explain markdown usage in relation to dynamic documents.
  4. Differentiate between code chunks and code cells in RMarkdown and Jupyter Notebooks.
  5. Select appropriate code chunk options for RMarkdown.
  6. Use semantic line breaks for version control files.
  7. Specify metadata in the YAML header block.

Lesson 6

  1. Understand how Quarto extend R Markdown documents functionalities.
  2. Explore different data science products to communicate your results: slides, blogs and books.
  3. Create slides using Jupyter Notebook and Quarto slides with reveal.js
  4. Create a Jupyter Book and a Quarto books.
  5. Create a Quarto Blog.
  6. Sharing rendered HTML files publicly via GitHub Pages.

Lesson 7

  1. Understand what is a computational environment and how can ensure the reproducibility of a project
  2. Differenciate Python, Anaconda, MiniConda, Conda and pip
  3. Manage packages and environments in Python using Conda
  4. Manage packages and environments in R using renv

Lesson 8

  1. Understand the basic syntax and functionality of regular expressions (regex) for pattern matching.
  2. Explore the use of special characters, ranges, and anchors in regex to match specific patterns within text.
  3. Apply regex to search, extract, and manipulate data in various formats using practical examples.
  4. Use regular expressions to navigate and organize files within the filesystem.

Attributions

Materials were inspired, re-used and re-mixed from the following sources: - Software Carpentry, specifically the Unix Shell and Git lessons - Happy Git and GitHub for the useR by Jenny Bryan and the STAT 545 TAs - Data 8: The Foundations of Data Science, specifically Lab 01 - Data Carpentry Reproducible Science Workshop

License

The UBC Master of Data Science DSCI 521: Computing Platforms for Data Science course materials here are licensed under the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA). If re-using/re-mixing please provide attribution and link to this webpage.