This is a data analysis project for DSCI 522 (Data Science workflows); a course in the Master of Data Science program at the University of British Columbia.
After the state proclaimed the restoration of indoor dining during the COVID 19 era, hundreds of new restaurants have opened throughout New York City (Eater NY, 2020). Now that things are getting back to normal as the restrictions set by government are lifted and people are returning back to dining out as the hotel sector reopens, the general safety of restaurants has taken on utmost importance in light of the current state of affairs of COVID. The standards that health inspectors use for grading will probably need to be revised because health rules have become more stringent in order to curb the pandemic. The overall plan used by health inspection is as follows, though it may vary by state:
(Source: SmartSense, 2018)
As data scientists, we’re curious about how we can evaluate and predict a restaurant’s general level of quality so that we can provide recommendation on the right restaurants which can be dined in safely by classifying the restaurant as “good” or “poor” (in our case, Grade A vs.Grade B/C). As we have access to restaurant data for the New York City, we would like to concentrate our analysis on forecasting the grading of restaurants as Good or Poor for specific NYC locations, with plans to eventually expand to other metropolitan regions. We believe that our effort could be useful to the residents or tourists in the NY city and this could be a one stop solution for people who are looking to dine in without having to worry about the quality.
Research question :
Can we predict the grade for a restaurant (Grade A or F) given different metrics describing their health violations during a routine inspection?
Besides this main research question, our analysis would also like to address some interesting sub-questions given below:
The data set that we are using in our analysis for the restaurant grading, DOHMH New York City Restaurant Inspection Results, is sourced from NYC OpenData Portal. It was obtained from the tidytuesday (Mock, 2022) repository by Thomas Mock. The original data set can be found here.
Summary -
The data includes all of the violation citation from the restaurant inspections held in New York City from 2012 to 2018. Each row represents a restaurant that has undergone a health inspection which has the information about each establishment including the restaurant name, phone number, location (borough, building number, street, zip code), cuisine type, and also the details about the inspection itself which includes date, violation code, description, whether there were any violations cited, whether they were critical, etc.). The restaurants may receive an official grading of A, B, or C; alternatively, they may receive a Z or P for an evaluation that is still pending. Here is a complete dictionary of the data can be found here.
We performed the exploratory data analysis on the restaurant dataset and we noticed that the total strength of inspections were 3,00,000, out of which only 151,451 had a value filled in for the grade column that we are interested in.
Table 1.1 Counts of inspections in the training data by class.
As we can see from the above table, there is a significant class imbalance of which 79.8% inspections are graded as A. Hence, we’ve chosen to approach our research question as a binary classification problem , where the outcome will determine whether a restaurant should be graded as A(Pass) or F(Fail - clubbing the B and C grades) based on the standards that are set. We have excluded the restaurants with “PENDING” grade and will be considering in the deployment data in order to predict the grade using our model.
We performed the rest our analysis on the training data where we split the initial data set such that 75% of the data will go to our train data and the rest 25% will be for validating the performance of the model on restaurants which hasn’t been graded yet based on the inspection features that we have.
One important point to note here is that when we grouped the
restaurants by camis
feature, we could see that many
restaurants were inspected more than once and we are not sure on whether
the restaurants share the same name or if some restaurants have changes
their name in between 2012 and 2018. Since we could not incorporate this
issue while modelling, we have added it to the limitations.
Fig 1 :
Figure 1. Distribution of Scores by Grade
From the above plot, we can see that the grade F restaurants are associated with higher scores on an average when compared to that of the graded A restaurants even though some of them have low scores. We can conclude that the scores are higher for more critical health violations, but we cannot generalize as we do not see a hard cut off for when a restaurant is graded A or not.
Fig 2 :
Figure 2. Severity of Violations
In the above figure, the plot suggests that the Grade F restaurants receive proportionally more red flags related to violations than Grade A restaurants do, but it is interesting to see that even grade A restaurants have had some critical violations. It will be intriguing to see if our model can determine whether the seriousness of a violation actually counts for grading because it is unclear what the threshold for a “major” violation is.
Fig 3 :
Figure 3. Number of inspections conducted by NYC Borough
We should be able to dine in any neighborhood of NYC because all of the boroughs have a majority of Grade A restaurants. It is clear that the majority of the inspections took place in Manhattan, which also has the highest concentration of restaurants receiving a Grade F rating among the other boroughs.
The complete EDA including the above figures and tables can be found here.
Table 2.1. Mean train and validation scores from each model.
Hyper parameter tuning results on logreg -
Table 2.3. Mean train and cross-validation scores (5-fold) for balanced logistic regression, optimizing F1 score.
By using Random Search CV, we optimized the hyper parameters of the balanced logistic regression model to the following :
C - 0.024947
max_features - 130
max_categories - 47
Train/validation scores from the best model -
After performing cross validation on the training set using our optimized hyper parameters for our model, we got a good f1 score of 0.975. Both the precision and recall scores for the validation set are high, indicating that the model is accurate about its prediction (whether a restaurant will receive an F grade or not).
Table 2.4. Mean and standard deviation of train and validation scores for the balanced logistic regression model.
Classification report from the best model on the test set -
Table 2.5. Classification report on the test set
Confusion matrices from the best model on train and test set
Figure 4. Confusion matrices from the best model on train and test set
PR curve from test set -
The below PR curve depicts that if we keep our new threshold (after balancing), we have an optimum solution with high precision and high recall value. If this threshold is lowered, the recall score also could get lesser and we may not be successful in classifying the restaurants correctly to the GRADE F class.
Figure 5. PR curve from test set
ROC curve from test set -
The ROC curve is a plot between the False Positive Rate and the True Positive Rate. Through this graph we find the area under the curve (AUC) is 1.00. This is the optimum value for an AUC and tells us that the predictions from our model are 100% correct.
Figure 6. ROC curve from test set
NOTE on high f1 score -
We are aware of the fact that f1 precision and recall score of our model on the train, validation and test sets are quite high. This may be because there are underlying linear relationships between different features and the target.
In our data analysis, we are making the following assumptions -
The EDA shows that many of the restaurants have undergone inspections
more than once, with the help of the camis
feature. But, it
is unclear whether some restaurants share the same name, or if some
restaurants have changed their name between 2012 and 2018.
Unfortunately, we were unable to parse the data by this feature to
ensure that the restaurants that are reviewed more than once are not
included both in the training and validation/test sets. Since we aren’t
sure whether the models are in fact learning the ‘features’ or the
specific restaurant examples, there might be some discrepancies in the
prediction results. We also had to downsample our training data set in
order to reduce the training time as we are in short of computational
resources.
camis
features and performing predictions using the new
model for other cities.Pedregosa ve diğerleri (2011) Cortes ve Vapnik (1995) Van Rossum ve Drake (2009) McKinney ve diğerleri (2011) de Jonge (2018) Keleshev (2014) Xie (2014) “Anaconda Software Distribution” (2020) Software Distribution Pérez ve Granger (2007) VanderPlas ve diğerleri (2018) Cox (1958) Hunter (2007)
Footer © 2022 GitHub, Inc. Footer navigation Terms Privacy