Organization of Data Science projects and some useful tools
Learning outcomes
- Understand the basic syntax and functionality of regular expressions (regex) for pattern matching.
- Explore the use of special characters, ranges, and anchors in regex to match specific patterns within text.
- Apply regex to search, extract, and manipulate data in various formats using practical examples.
- Use regular expressions to navigate and organize files within the filesystem.
Activity 1
Imagine you are starting a new data science project. Based on what you know about project organization, how would you structure your project directory? Consider where you would place important files like raw data, processed data, code, reports, and environment settings.
Take a moment to think through these questions:
- What folders would you create, and why?
- Where would you place key files like your `.csv` data, analysis notebooks, and README?
- How would you structure the project to ensure it is easy to maintain and reproducible?
Jot down your ideas.
After reflecting on your answers, compare your project structure with your partner’s. Identify similarities and differences between your approaches. What organizational choices did you both make? Were there any unique folder structures or file placements?
Project organization
Broadly speaking you will find two different ways of organizing a project: as a data analysis project and code project. Neither of this options is rigid enough to can give you here a exhaustive list of folder types and files you would expect on this projects. You will spend some time exploring different examples. However, we can work with a general structure:
project/
├── data/ *.csv
│ ├── processed/
│ └── raw/
├── reports/ *.ipynb *.Rmd
├── src/ *.py *.R
├── doc/ *.md
├── README.md
└── environment.yaml (or renv.lock)
data If your original data file is stored as part of the repository and should never be overridden. That is why the original files are stored in a sub-folder called raw
.
reports The documentation is the biggest difference between data analysis projects and software projects. Would you expect a report on a software project? In general notebooks and other literate programming documents (that prioritize narrative and storytelling) are the favorite options to share data analysis results [1]. You will create your first data science project in DSCI 522: Statistical Inference and Computation. Software projects are mostly focused in the development of tools and in this cases other types of documentation (docstrings, vignettes, readtheDocs, JupyterBooks) are the favorite source of documentation for this kind of projects. We will learn more about them in DSCI 524: Collaborative Software Development.
src is where the source code for your project will be. If you are not creating a package, you may use the analysis directory. In package development, you will have a src/project_name
directory in Python, and just the R
directory (instead of src
) in R. You will learn more about project hierarchy when making packages in DSCI 524.
doc Documentation. You can also use this directory for a GitHub pages enabled website.
README.md The README file is probably the first file a potential user/contributor will read.
environment Specifying the computational environment will help you to ensure the reproducibility of your analysis.
Activity 2
Which set of file(name)s do you want at 3 a.m. before a deadline?
A. `Joe’s Filenames Use Spaces and Punctuation.xlsx`
B. `figure 1.png`
C. `JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt`
D. `2014-06-08_abstract-for-sla.docx`
E. `myabstract.docx`
Filenames
In our first lecture, we learned about bash and talked about the downsides of making your files and folder names “too pretty” with capitalizations and spaces. Here we will discuss better ways to name our files and folders.
The below section comes from Jenny Bryan’s Naming things
- slides: https://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf
- Source code: https://github.com/jennybc/organization-and-naming
Names matter
What works, what doesn’t?
NO
myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
YES
2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_talk-scatterplot-length-vs-interest.png
fig02_talk-histogram-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
Three principles for (file) names
- Machine readable
- Human readable
- Plays well with default ordering
Machine readable
- Regular expression and globbing friendly
- Avoid spaces, punctuation, accented characters, case sensitivity
- Easy to compute on
- Deliberate use of delimiters
Globbing
Excerpt of complete file listing:
Example of globbing to narrow file listing:
Same using Mac OS Finder search facilities
Same using regex in R
Punctuation
Deliberate use of "-"
and "_"
allows recovery of meta-data from the filenames:
"_"
underscore used to delimit units of meta-data I want later"-"
hyphen used to delimit words so my eyes don’t bleed
This happens to be R but also possible in the shell, Python, etc.
Recap: machine readable
Easy to search for files later
Easy to narrow file lists based on names
Easy to extract info from file names, e.g. by splitting
New to regular expressions and globbing? be kind to yourself and avoid
- Spaces in file names
- Punctuation
- Accented characters
- Different files named
foo
andFoo
Human readable
Name contains info on content
Connects to concept of a slug from semantic URLs
Which set of file(name)s do you want at 3 a.m. before a deadline?
Embrace the slug
slug filenames
slug
Recap: Human readable
Easy to figure out what the heck something is, based on its name
Plays well with default ordering
- Put something numeric first
- Use the ISO 8601 standard for dates
- Left pad other numbers with zeros
Chronological order:
Logical order: Put something numeric first
Dates
Use the ISO 8601 standard for dates: YYYY-MM-DD
Comprehensive map of all countries in the world that use the MM-DD-YYYY format
From https://twitter.com/donohoe/status/597876118688026624.
Left pad other numbers with zeros
If you don’t left pad, you get this:
10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R
which is just sad :(
Recap: Plays well with default ordering
- Put something numeric first
- Use the ISO 8601 standard for dates
- Left pad other numbers with zeros
Recap
Three principles for (file) names
- Machine readable
- Human readable
- Plays well with default ordering
Pros: - Easy to implement NOW - Payoffs accumulate as your skills evolve and projects get more complex
Go forth and use awesome file names :)
Activity 3
Which of the following items describe three important rules for naming files on a computer? Select all correct answers.
A. File name parts should be easy to extract programmatically
B. Humans would understand something about file contents by looking at their names
C. Files with appropriate names would be nicely organized by default
D. File names should always use lower-case letters
E. Human understanding of file names should be prioritized over machine understanding
Activity 4
Which of the following are GOOD date formats to use in a filename?
A. 2020-01-26T0233
B. 20200126T0233
C. 2020-01-26
References
[1] Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. A. (2018, April). The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-11).