Organization of Data Science projects and some useful tools

Learning outcomes

  1. Understand the basic syntax and functionality of regular expressions (regex) for pattern matching.
  2. Explore the use of special characters, ranges, and anchors in regex to match specific patterns within text.
  3. Apply regex to search, extract, and manipulate data in various formats using practical examples.
  4. Use regular expressions to navigate and organize files within the filesystem.

Activity 1

Imagine you are starting a new data science project. Based on what you know about project organization, how would you structure your project directory? Consider where you would place important files like raw data, processed data, code, reports, and environment settings.

Take a moment to think through these questions:

- What folders would you create, and why?
- Where would you place key files like your `.csv` data, analysis notebooks, and README?
- How would you structure the project to ensure it is easy to maintain and reproducible?

Jot down your ideas.

After reflecting on your answers, compare your project structure with your partner’s. Identify similarities and differences between your approaches. What organizational choices did you both make? Were there any unique folder structures or file placements?

Project organization

Broadly speaking you will find two different ways of organizing a project: as a data analysis project and code project. Neither of this options is rigid enough to can give you here a exhaustive list of folder types and files you would expect on this projects. You will spend some time exploring different examples. However, we can work with a general structure:

project/
├── data/              *.csv
   ├── processed/
   └── raw/
├── reports/           *.ipynb *.Rmd
├── src/               *.py *.R
├── doc/               *.md
├── README.md
└── environment.yaml (or renv.lock)

data If your original data file is stored as part of the repository and should never be overridden. That is why the original files are stored in a sub-folder called raw.

reports The documentation is the biggest difference between data analysis projects and software projects. Would you expect a report on a software project? In general notebooks and other literate programming documents (that prioritize narrative and storytelling) are the favorite options to share data analysis results [1]. You will create your first data science project in DSCI 522: Statistical Inference and Computation. Software projects are mostly focused in the development of tools and in this cases other types of documentation (docstrings, vignettes, readtheDocs, JupyterBooks) are the favorite source of documentation for this kind of projects. We will learn more about them in DSCI 524: Collaborative Software Development.

src is where the source code for your project will be. If you are not creating a package, you may use the analysis directory. In package development, you will have a src/project_name directory in Python, and just the R directory (instead of src) in R. You will learn more about project hierarchy when making packages in DSCI 524.

doc Documentation. You can also use this directory for a GitHub pages enabled website.

README.md The README file is probably the first file a potential user/contributor will read.

environment Specifying the computational environment will help you to ensure the reproducibility of your analysis.

Activity 2

Which set of file(name)s do you want at 3 a.m. before a deadline?

A. `Joe’s Filenames Use Spaces and Punctuation.xlsx`
B. `figure 1.png`
C. `JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt`
D. `2014-06-08_abstract-for-sla.docx`
E. `myabstract.docx`

Filenames

In our first lecture, we learned about bash and talked about the downsides of making your files and folder names “too pretty” with capitalizations and spaces. Here we will discuss better ways to name our files and folders.

The below section comes from Jenny Bryan’s Naming things

Names matter

cheers

What works, what doesn’t?

NO

myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt

YES

2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_talk-scatterplot-length-vs-interest.png
fig02_talk-histogram-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt

Three principles for (file) names

  1. Machine readable
  2. Human readable
  3. Plays well with default ordering

Machine readable

awesome_names
  • Regular expression and globbing friendly
    • Avoid spaces, punctuation, accented characters, case sensitivity
  • Easy to compute on
    • Deliberate use of delimiters

Globbing

Excerpt of complete file listing:

plasmid_names

Example of globbing to narrow file listing:

plasmid_names

Same using Mac OS Finder search facilities

plasmid_mac_os_search

Same using regex in R

plasmid_regex

Punctuation

Deliberate use of "-" and "_" allows recovery of meta-data from the filenames:

  • "_" underscore used to delimit units of meta-data I want later
  • "-" hyphen used to delimit words so my eyes don’t bleed

plasmid_delimiters

plasmid_delimiters_code

This happens to be R but also possible in the shell, Python, etc.

Recap: machine readable

  • Easy to search for files later

  • Easy to narrow file lists based on names

  • Easy to extract info from file names, e.g. by splitting

  • New to regular expressions and globbing? be kind to yourself and avoid

    • Spaces in file names
    • Punctuation
    • Accented characters
    • Different files named foo and Foo

Human readable

  • Name contains info on content

  • Connects to concept of a slug from semantic URLs

Which set of file(name)s do you want at 3 a.m. before a deadline?

human_readable_not_options

Embrace the slug

slug filenames

slug

Recap: Human readable

Easy to figure out what the heck something is, based on its name

Plays well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Chronological order:

chronological_order

Logical order: Put something numeric first

logical_order

Dates

Use the ISO 8601 standard for dates: YYYY-MM-DD

chronological_order

iso_psa

Comprehensive map of all countries in the world that use the MM-DD-YYYY format

map_mmddyyy


From https://twitter.com/donohoe/status/597876118688026624.

Left pad other numbers with zeros

logical_order


If you don’t left pad, you get this:

10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R

which is just sad :(

Recap: Plays well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Recap

Three principles for (file) names

  1. Machine readable
  2. Human readable
  3. Plays well with default ordering

Pros: - Easy to implement NOW - Payoffs accumulate as your skills evolve and projects get more complex

Go forth and use awesome file names :)

chronological_order


logical_order

Activity 3

Which of the following items describe three important rules for naming files on a computer? Select all correct answers.

A. File name parts should be easy to extract programmatically
B. Humans would understand something about file contents by looking at their names
C. Files with appropriate names would be nicely organized by default
D. File names should always use lower-case letters
E. Human understanding of file names should be prioritized over machine understanding

Activity 4

Which of the following are GOOD date formats to use in a filename?

A. 2020-01-26T0233
B. 20200126T0233
C. 2020-01-26

References

[1] Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. A. (2018, April). The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-11).