Results

For each subreddit-specific dataset of processed data we computed Wilcoxon rank-sum statistic (also referred to as the Mann-Whitney-Wilcoxon rank-sum), comparing the difference in median number of references to substance abuse per reddit-post in the ‘pre-COVID’ and ‘post-COVID’ datasets. In our test we employ Scipy stats packages’ ranksums method [VGO+20].

According to Scipy documentation:

We can test the hypothesis that two independent unequal-sized samples are drawn from the same distribution with computing the Wilcoxon rank-sum statistic.

And according to this ISIXSIGMA article

The Mann-Whitney test compares the medians from two populations and works when the Y variable is continuous, discrete-ordinal or discrete-count, and the X variable is discrete with two attributes.

While this test may not exactly fit our use case for reasons unknown to us - we do not have any better guidance at this point as to choosing a more well-suited test for the purpose of measuring a statistically significant difference in medians between these two datasets.

Using the Wilcoxon rank-sum test, we can set up our hypotheses for this test as follows:

\(H_0:\) median number of references to substance abuse per reddit-post is the same in subreddit-specific ‘pre-COVID’ and ‘post-COVID’ datasets.

\(H_a:\) median number of references to substance abuse per reddit-post is not the same in subreddit-specific ‘pre-COVID’ and ‘post-COVID’ datasets.

subreddit_topic test_statistic p_value
0 bipolarreddit -0.692095 4.888776e-01
1 EDAnonymous 1.922717 5.451560e-02
2 socialanxiety 0.990075 3.221376e-01
3 alcoholism -1.407757 1.592032e-01
4 lonely -2.041229 4.122808e-02
5 healthanxiety 0.184029 8.539905e-01
6 ptsd -0.551019 5.816208e-01
7 suicidewatch 1.643746 1.002287e-01
8 addiction 0.481856 6.299085e-01
9 bpd -0.454428 6.495209e-01
10 autism 1.358547 1.742903e-01
11 schizophrenia 0.246796 8.050664e-01
12 adhd 2.955800 3.118593e-03
13 depression 1.693505 9.035944e-02
14 anxiety 6.492918 8.418971e-11

Given we set a standard threshold \(\alpha = 0.05\) for statistical significance, the conclusions we may be able to draw from these results are:

  • r/adhd and r/lonely saw a statistically significant difference between median number of references to substance abuse per reddit-post when comparing the ‘pre-COVID’ and ‘post-COVID’ datasets

  • the remaining subreddits tested showed no statistically significant difference between median number of references to substance abuse per reddit-post when comparing the ‘pre-COVID’ and ‘post-COVID’ datasets.