Methodology¶

In order to investigate how substance use on Reddit has changed over the pandemic, we first introduce the dataset and then conduct an exploratory data analysis (eda) on several Reddit datasets which is described below.

Data set¶

The datasets were obtained from Reddit mental health dataset, a dataset processed and organised by Low et al [LRT+20]. They provide us with two CSV files, for each period (pre and post pandemic), for 15 different subreddits.

Each observation is a Reddit user’s post - a message written on a specific subreddit - which has been processed to extract features that are common in natural language processing.

Feature details¶

Feature	Description
author	author of the Reddit post
date	date of the Reddit post
post	raw post text
automated_readability_index	a readability metric for English text which measures the understandability of a text
coleman_liau_index	a readability metric for English text which measures the understandability of a text
flesch_kincaid_grade_level	a readability metric for English text which measures how difficult a piece of text is to understand
flesch_reading_ease	\({\displaystyle 206.835-1.015\left({\frac {\text{total words}}{\text{total sentences}}}\right)-84.6\left({\frac {\text{total syllables}}{\text{total words}}}\right)}\)
gulpease_index	a readability metric based on the length of words <br > the number of words, and the length of sentences.
gunning_fog_index	a readability metric for English text which measures the understandability of a text
lix	a readability metric for English text which measures how difficult a piece of text is to understand
smog_index	a readability metric that measures how many years of education the average person needs to have to understand a text.
wiener_sachtextformel	a readability metric which measures the understandability of a text
n_chars	number of characters in post
n_long_words	number of long words in post
n_monosyllable_words	number of monosyllabic words in post
n_polysyllable_words	number of polysyllabic words in post
n_syllables	number of syllables in post
n_unique_words	number of unique words in post
n_words	number of words in post
sent_neg	negative sentiment score of post
sent_neu	neutral sentiment score of post
sent_pos	positive sentiment score of post
economic_stress_total	count of mentions of economic stress in post
isolation_total	count of mentions of isolation in post
substance_use_total	count of mentions of substance abuse in post
guns_total	count of mentions of guns in post
domestic_stress_total	count of mentions of domestic stress in post
suicidality_total	count of mentions of suicide in post
punctuation	count of punctuation in post
LIWC-based metrics	Linguistic Inquiry and Word Count - a metric derived from the degree to which various categories of words are used in a text
TF-IDF-based metrics	Term frequency–inverse document frequency - a statistic that is tries to reflect how important a word is in a piece of text

The feature extractions are as follows (n is the number of columns):

LIWC (n=62);
sentiment analysis (n=4);
basic word and syllable counts (n=8);
punctuation (n=1);
readability metrics (n=9);
term frequency–inverse document frequency (TF-IDF) ngrams (256-1024) to capture words and phrases that characterize specific posts;
manually built lexicons about suicidality (n=1), economic stress (n=1), isolation (n=1), substance use (n=1), domestic stress (n=1), and guns (n=1).

Alongside these features include:

author (Reddit user name)
date
post

Data cleaning/transformation¶

In order to answer the question “How has the substance use increased over the pandemic?”, the feature substance_use_total is selected as the target feature, and the data is cleaned to focus exclusively on this feature.

The Python programming language [pyp] and the Pandas library [pdt20] were used to perform the data cleaning process.

Data Cleaning Steps¶

Combine the pre and post datasets into one dataset.
Filter the dataset to keep only the columns of interest, including subreddit, author, date, post, and substance_use_total.

Feature Engineering¶

We apply a simple method by adding a new feature period to indicate the timeframe of the posts (before or after the pandemic). This helps to represent the data better and to compare the timeframes more easily in later parts of this report.

Room for improvement¶

It is noted that in our process_raw.py script, we used the try-except block to make the script runnable. The reason behind is to allow the script to load all files except for files with extension .DS_Store. The .DS_Store files are automatically created by Mac OS X Finder in browsed directories.
In case more time to work on this is given, we expect to enhance the process_raw.py script without using the try-except block.

Covid Reddit Behaviour