Results¶
For each subreddit-specific dataset of processed data we computed Wilcoxon rank-sum statistic (also referred to as the Mann-Whitney-Wilcoxon rank-sum), comparing the difference in median number of references to substance abuse per reddit-post in the ‘pre-COVID’ and ‘post-COVID’ datasets. In our test we employ Scipy stats packages’ ranksums method [VGO+20].
According to Scipy documentation:
We can test the hypothesis that two independent unequal-sized samples are drawn from the same distribution with computing the Wilcoxon rank-sum statistic.
And according to this ISIXSIGMA article
The Mann-Whitney test compares the medians from two populations and works when the Y variable is continuous, discrete-ordinal or discrete-count, and the X variable is discrete with two attributes.
While this test may not exactly fit our use case for reasons unknown to us - we do not have any better guidance at this point as to choosing a more well-suited test for the purpose of measuring a statistically significant difference in medians between these two datasets.
Using the Wilcoxon rank-sum test, we can set up our hypotheses for this test as follows:
\(H_0:\) median number of references to substance abuse per reddit-post is the same in subreddit-specific ‘pre-COVID’ and ‘post-COVID’ datasets.
\(H_a:\) median number of references to substance abuse per reddit-post is not the same in subreddit-specific ‘pre-COVID’ and ‘post-COVID’ datasets.
| subreddit_topic | test_statistic | p_value | |
|---|---|---|---|
| 0 | bipolarreddit | -0.692095 | 4.888776e-01 |
| 1 | EDAnonymous | 1.922717 | 5.451560e-02 |
| 2 | socialanxiety | 0.990075 | 3.221376e-01 |
| 3 | alcoholism | -1.407757 | 1.592032e-01 |
| 4 | lonely | -2.041229 | 4.122808e-02 |
| 5 | healthanxiety | 0.184029 | 8.539905e-01 |
| 6 | ptsd | -0.551019 | 5.816208e-01 |
| 7 | suicidewatch | 1.643746 | 1.002287e-01 |
| 8 | addiction | 0.481856 | 6.299085e-01 |
| 9 | bpd | -0.454428 | 6.495209e-01 |
| 10 | autism | 1.358547 | 1.742903e-01 |
| 11 | schizophrenia | 0.246796 | 8.050664e-01 |
| 12 | adhd | 2.955800 | 3.118593e-03 |
| 13 | depression | 1.693505 | 9.035944e-02 |
| 14 | anxiety | 6.492918 | 8.418971e-11 |
Given we set a standard threshold \(\alpha = 0.05\) for statistical significance, the conclusions we may be able to draw from these results are:
r/adhd and r/lonely saw a statistically significant difference between median number of references to substance abuse per reddit-post when comparing the ‘pre-COVID’ and ‘post-COVID’ datasets
the remaining subreddits tested showed no statistically significant difference between median number of references to substance abuse per reddit-post when comparing the ‘pre-COVID’ and ‘post-COVID’ datasets.