Methodology¶
In order to investigate how substance use on Reddit has changed over the pandemic, we first introduce the dataset and then conduct an exploratory data analysis (eda) on several Reddit datasets which is described below.
Data set¶
The datasets were obtained from Reddit mental health dataset, a dataset processed and organised by Low et al [LRT+20]. They provide us with two CSV files, for each period (pre and post pandemic), for 15 different subreddits.
Each observation is a Reddit user’s post - a message written on a specific subreddit - which has been processed to extract features that are common in natural language processing.
Feature details¶
Feature |
Description |
|---|---|
author |
author of the Reddit post |
date |
date of the Reddit post |
post |
raw post text |
automated_readability_index |
a readability metric for English text which measures the understandability of a text |
coleman_liau_index |
a readability metric for English text which measures the understandability of a text |
flesch_kincaid_grade_level |
a readability metric for English text which measures how difficult a piece of text is to understand |
flesch_reading_ease |
\({\displaystyle 206.835-1.015\left({\frac {\text{total words}}{\text{total sentences}}}\right)-84.6\left({\frac {\text{total syllables}}{\text{total words}}}\right)}\) |
gulpease_index |
a readability metric based on the length of words <br > the number of words, and the length of sentences. |
gunning_fog_index |
a readability metric for English text which measures the understandability of a text |
lix |
a readability metric for English text which measures how difficult a piece of text is to understand |
smog_index |
a readability metric that measures how many years of education the average person needs to have to understand a text. |
wiener_sachtextformel |
a readability metric which measures the understandability of a text |
n_chars |
number of characters in post |
n_long_words |
number of long words in post |
n_monosyllable_words |
number of monosyllabic words in post |
n_polysyllable_words |
number of polysyllabic words in post |
n_syllables |
number of syllables in post |
n_unique_words |
number of unique words in post |
n_words |
number of words in post |
sent_neg |
negative sentiment score of post |
sent_neu |
neutral sentiment score of post |
sent_pos |
positive sentiment score of post |
economic_stress_total |
count of mentions of economic stress in post |
isolation_total |
count of mentions of isolation in post |
substance_use_total |
count of mentions of substance abuse in post |
guns_total |
count of mentions of guns in post |
domestic_stress_total |
count of mentions of domestic stress in post |
suicidality_total |
count of mentions of suicide in post |
punctuation |
count of punctuation in post |
LIWC-based metrics |
Linguistic Inquiry and Word Count - a metric derived from the degree to which various categories of words are used in a text |
TF-IDF-based metrics |
Term frequency–inverse document frequency - a statistic that is tries to reflect how important a word is in a piece of text |
The feature extractions are as follows (n is the number of columns):
LIWC (n=62);
sentiment analysis (n=4);
basic word and syllable counts (n=8);
punctuation (n=1);
readability metrics (n=9);
term frequency–inverse document frequency (TF-IDF) ngrams (256-1024) to capture words and phrases that characterize specific posts;
manually built lexicons about suicidality (n=1), economic stress (n=1), isolation (n=1), substance use (n=1), domestic stress (n=1), and guns (n=1).
Alongside these features include:
author (Reddit user name)
date
post
Data cleaning/transformation¶
In order to answer the question “How has the substance use increased over the pandemic?”, the feature substance_use_total is selected as the target feature, and the data is cleaned to focus exclusively on this feature.
The Python programming language [pyp] and the Pandas library [pdt20] were used to perform the data cleaning process.
Data Cleaning Steps¶
Combine the
preandpostdatasets into one dataset.Filter the dataset to keep only the columns of interest, including
subreddit,author,date,post, andsubstance_use_total.
Feature Engineering¶
We apply a simple method by adding a new feature
periodto indicate the timeframe of the posts (before or after the pandemic). This helps to represent the data better and to compare the timeframes more easily in later parts of this report.
Room for improvement¶
It is noted that in our
process_raw.pyscript, we used thetry-exceptblock to make the script runnable. The reason behind is to allow the script to load all files except for files with extension.DS_Store. The.DS_Storefiles are automatically created by Mac OS X Finder in browsed directories.In case more time to work on this is given, we expect to enhance the
process_raw.pyscript without using thetry-exceptblock.