Programming in Python for Data Science – Good Function Design Choices

How to write good functions

What makes a function useful?

Is a function more useful when it does more operations?

Do adding parameters make your functions more or less functional?

These are all questions we need to think about when writing functions.

1. Avoid “hard coding.”

Hard coding is the process of embedding values directly into your code without saving them in objects.

def squares_a_list(numerical_list):
    new_squared_list = list()
    
    for number in numerical_list:
        new_squared_list.append(number ** 2)
    
    return new_squared_list

def exponent_a_list(numerical_list, exponent):
    new_exponent_list = list()
    
    for number in numerical_list:
        new_exponent_list.append(number ** exponent)
    
    return new_exponent_list

Hard coding is the process of embedding values directly into your code without saving them in variables

When we hardcode values into our code, it decreases flexibility.

Being inflexible can cause you to end up writing more functions and/or violating the DRY principle.

This, in turn, can decrease the readability and makes code problematic to maintain. In short, hard coding is a breeding ground for bugs.

Remember our function squares_a_list()?

In this function, we “hard-coded” in 2 when we calculated number ** 2.

There are a couple of approaches to improving the situation. One is to assign 2 to a variable in the function before doing this calculation. That way, if you need to reuse that number, later on, you can just refer to the variable; and if you need to change the 2 to a 3, you only need to change it in one place. Another benefit is that you’re giving it a variable name, which acts as a little bit of documentation.

The other approach is to turn the value into an argument like we did when we made exponent_a_list().

This new function now gives us more flexibility with our code.

If we now encounter a situation where we need to calculate each element to a different exponent like 4 or 0, we can do so without writing new code and potentially making a new error in doing so.

We reduce our long term workload.

This version is more maintainable code, but it doesn’t give the function caller any flexibility. What you decide depends on how you expect your function to be used.

2. Less is More

def load_filter_and_average(file, grouping_column, ploting_column):
    df = pd.read_csv("data/" + file)
    source = df.groupby(grouping_column).mean(numeric_only=True).reset_index()
    chart = alt.Chart(source, width = 500, height = 300).mark_bar().encode(
                      x=alt.X(grouping_column),
                      y=alt.Y(ploting_column)
            )
    return chart

bad_idea = load_filter_and_average('cereal.csv', 'mfr', 'rating')
bad_idea

Although it may seem useful when a function acts as a one-stop-shop that does everything you want in a single function, this also limits your ability to reuse code that lies within it.

Ideally, functions should serve a single purpose.

For example, let’s say we have a function that reads in a csv, finds the mean of each group in a column, and plots a specified variable.

Although this may seem nice, we may want to break this up into multiple smaller functions. For example, what if we don’t want the plot? Perhaps the plot is just something we wanted a single time, and now we are committed to it for each time we use the function.

Another problem with this function is that the means are only printed and not returned. Thus, we have no way of accessing the statistics to use further in our code (we would have to repeat ourselves and rewrite it).

def grouped_means(df, grouping_column):
    grouped_mean = df.groupby(grouping_column).mean(numeric_only=True).reset_index()
    return grouped_mean

cereal_mfr = grouped_means(cereal, 'mfr')
cereal_mfr

	mfr	calories	protein	fat	...	shelf	weight	cups	rating
0	A	100.000000	4.000000	1.000000	...	2.000000	1.000000	1.000000	54.850917
1	G	111.363636	2.318182	1.363636	...	2.136364	1.049091	0.875000	34.485852
2	K	108.695652	2.652174	0.608696	...	2.347826	1.077826	0.796087	44.038462
...	...	...	...	...	...	...	...	...	...
4	P	108.888889	2.444444	0.888889	...	2.444444	1.064444	0.714444	41.705744
5	Q	95.000000	2.625000	1.750000	...	2.375000	0.875000	0.823750	42.915990
6	R	115.000000	2.500000	1.250000	...	2.000000	1.000000	0.871250	41.542997

7 rows × 14 columns

def plot_mean(df, grouping_column, ploting_column):
    chart = alt.Chart(df, width = 500, height = 300).mark_bar().encode(
                      x=alt.X(grouping_column),
                      y=alt.Y(ploting_column)
            )
    return chart

plot1 = plot_mean(cereal_mfr, 'mfr', 'rating')
plot1

3. Return a single object

def load_filter_and_average(file, grouping_column, ploting_column):
    df = pd.read_csv("data/" + file)
    source = df.groupby(grouping_column).mean(numeric_only=True).reset_index()
    chart = alt.Chart(source, width = 500, height = 300).mark_bar().encode(
                      x=alt.X(grouping_column),
                      y=alt.Y(ploting_column)
            )
    return chart, source

another_bad_idea = load_filter_and_average('cereal.csv', 'mfr', 'rating')
another_bad_idea

(alt.Chart(...),
    mfr    calories   protein  ...    weight      cups     rating
 0    A  100.000000  4.000000  ...  1.000000  1.000000  54.850917
 1    G  111.363636  2.318182  ...  1.049091  0.875000  34.485852
 2    K  108.695652  2.652174  ...  1.077826  0.796087  44.038462
 ..  ..         ...       ...  ...       ...       ...        ...
 4    P  108.888889  2.444444  ...  1.064444  0.714444  41.705744
 5    Q   95.000000  2.625000  ...  0.875000  0.823750  42.915990
 6    R  115.000000  2.500000  ...  1.000000  0.871250  41.542997
 
 [7 rows x 14 columns])

another_bad_idea[0]

4. Keep global variables in their global environment

def grouped_means(df, grouping_column):
    grouped_mean = df.groupby(grouping_column).mean(numeric_only=True).reset_index()
    return grouped_mean

cereal = pd.read_csv('data/cereal.csv')

def bad_grouped_means(grouping_column):
    grouped_mean = cereal.groupby(grouping_column).mean(numeric_only=True).reset_index()
    return grouped_mean

bad_cereal_grouping = bad_grouped_means('mfr')
bad_cereal_grouping.head(3)

	mfr	calories	protein	...	weight	cups	rating
0	A	100.000000	4.000000	...	1.000000	1.000000	54.850917
1	G	111.363636	2.318182	...	1.049091	0.875000	34.485852
2	K	108.695652	2.652174	...	1.077826	0.796087	44.038462

3 rows × 14 columns

cereal = "let's change it to a string" 
bad_cereal_grouping = bad_grouped_means('mfr')

AttributeError: 'str' object has no attribute 'groupby'

Detailed traceback: 
  File "<string>", line 1, in <module>
  File "<string>", line 2, in bad_grouped_means