Summary

In this analysis project, we aimed to answer whether the levels of PM2.5 air pollution in Beijing, China has improved between 2013 and 2017 based on hourly PM2.5 data collected from twelve testing stations in Beijing. To do so, we performed a one-tailed hypothesis test of independence in the medians of two intervals, time_A (March 2013 - February 2015) and time_B (March 2015 - February 2017). The testing result only reflects the air quality trend in Beijing from 2013 to 2017, and it does not have statistically evidence to prove there is significant decrease in PM2.5 particulate measurements.

Introduction

The objective of this project is to answer the following inferential question: Does PM2.5 measurement in Beijing, China collected from 2013 to 2017 show any sign of improvement?

Beijing, China has long struggled with poor air quality, a result of the country’s rapid industrialization and its heavy reliance on coal for electricity generation, as well as its growing and increasingly urban middle class (Wang and Hao 2012). In fact, in September 2021, the World Health Organization revised its air quality guidelines to more restrictive levels following the increasingly evident causal relationships between poor air quality and its harmful health consequences on impacted mainly urban communities (WHO 2021).

We analyse the Beijing Air Quality data set, donated to the UC Irvine Machine Learning Repository in 2019 (accessible via URL), which comprises hourly measurement of six air pollutants (including PM2.5, PM10, SO2, NO2, CO, O3) and six meteorological variables spanning from 2013 until 2017 across twelve of its metropolitan data-collecting stations. While the structure of this data set makes it suitable for multi-variate time-series regression analysis, we focus our analysis solely on the readings of the PM2.5 metric, a form of fine particulate matter that is considered especially harmful for its ability to penetrate deep into the lungs and cause long-lasting damage to the respiratory system (Xing et al. 2016).

The report for our exploratory data analysis can be found here.

Methodology

We combine the data from twelve data collection stations across Beijing into one data frame and split them into the following two time frames:

We chose to divide our dataset into two equal time intervals in order to see if there is a general change in the PM2.5 levels across these two time periods. While we have considered other means of dividing the data set, such as by year and mapping it according to government policies and its implementation, we made the decision not to, as we recognise that there is a time lag for their policies to reap tangible change.

Through Exploratory Data Analysis, we identify that our data is heavily skewed to the right. As such, we determined that it is best to analyse the median of each time interval, rather than other estimates.

Figure 2. Both time_A and time_B distributions are right-skewed

Figure 2. Both time_A and time_B distributions are right-skewed

We answer the main question of this project using the following methodology pipeline. Based on the theoretical assumption that data points across both samples are independent and identically distributed (i.i.d.) when their hourly collected data are distributed over multi-year time span, we performed a hypothesis test to determine whether there is statistical evidence to indicate an improvement in PM2.5 measurements in Beijing between 2013 and 2017. To do so, we implement a one-tailed hypothesis test to answer to compare these measurements between two equal-interval time intervals (time_A and time_B).

Hypothesis Testing

We use a hypothesis test for independence of a difference in medians via permutation to answer the main question main, using a significance level of \(\alpha\) = 0.05.

  • Null Hypothesis (\(H_0\)): The median PM2.5 value in Beijing in time_A is less than or equal to the median PM2.5 value in time_B (\(Q_{A}(0.5) \leq Q_{B}(0.5)\))

  • Alternative Hypothesis (\(H_A\)): There median PM2.5 value in Beijing in time_A is greater than the median PM2.5 values in time_B (\(Q_{A}(0.5) \gt Q_{B}(0.5)\))

Results and Discussion

Figure 3. Violin Plot showing that there is a difference in median PM2.5 value in time_A and time_B

Figure 3. Violin Plot showing that there is a difference in median PM2.5 value in time_A and time_B

After conducting the permutation test in the difference of medians, we get a p-value of 1, which is greater than the significance level \(\alpha\) = 0.05. The violin plot highlights how close these two median values are. As seen in the plot, the orange line indicates the estimate for each time interval (time_A and time_B).

These result answer our main question as follows: we do not have enough statistical evidence to reject the null hypothesis, \(H_0\). Hence, there is not enough statistical evidence to suggest that there has been a decrease in median PM2.5 measurement between time_A and time_B in Beijing, China. In other words, we conclude that PM2.5 measurements have remained unchanged or increased between these two time intervals, and have therefore at best not worsened.

A downfall in our data set is that we have a proportion of missing values for the PM2.5 levels. As indicated in the EDA, we have 8739 missing rows, out of the 420768 rows in the air_data data set. That is 2.08% of our data. We cannot be certain that this would drastically affect the test statistic and ultimately the p-value. However, it might have reduced the statistical power of our study and produced biased estimates, leading to an invalid conclusion.

A possible extension of our research could occur if we obtain data from 2018 onwards. This would allow us to observe a longer time period to conduct the hypothesis test, and see if there is a broader change in the PM2.5 levels.

Other extensions of our research could include: (1) splitting the 12 districts based on relative distance from the city centre (into two groups), to determine if there is a difference in PM2.5 levels between the metropolis and outskirts; (2) Find data from other cities in China to expand our question to whether there is an improvement in PM2.5 levels across China between 2013-2017.

References

Wang, Shuxiao, and Jiming Hao. 2012. “Air Quality Management in China: Issues, Challenges, and Options.” Journal of Environmental Sciences 24 (1): 2–13.
WHO. 2021. “WHO Global Air Quality Guidelines: Particulate Matter (Pm2. 5 and Pm10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide: Executive Summary.”
Xing, Yu-Fei, Yue-Hua Xu, Min-Hua Shi, and Yi-Xin Lian. 2016. “The Impact of Pm2. 5 on the Human Respiratory System.” Journal of Thoracic Disease 8 (1): E69.