Short Description

Interactive vs. scripted/unattended analyses and how to move fluidly between them. Reproducibility through automation and dynamic, literate documents. The use of version control and file organization to enhance machine- and human-readability.

Learning Outcomes

By the end of the course, students are expected to be able to:

  1. Analyze data interactively using read-eval-print-loop (REPL) processes; write scripts for non-interactive use; use tools and work styles to create fluidity between these two modes (e.g., RStudio IDE, iPython).
  2. Perform dynamic reporting functions such as integrating narrative, code, data, numerical results, and visual results; create reproducible reports and workflows (e.g., R Markdown, Project Jupyter).
  3. Manage projects by designing workflows for self-documentation, reproducibility, and collaboration; organize files with appropriate naming conventions; manage paths and dependencies.
  4. Use version control software (e.g., Git) including distributed version control and remote servers (e.g., GitHub, Bitbucket).
  5. Automate data science workflows (using e.g., Make, Galaxy).

Prerequisites

  • DSCI 511 (Programming for Data Science)
  • DSCI 521 (Computing Platforms for Data Science)

Reference Material

TBD

Instructor (2016-2017)

Note: information on this page is preliminary and subject to change.