6.1. Exercises
Text Data Questions
X = [ "Take me to the river",
"Drop me in the water",
"Push me in the river",
" dip me in the water"]
Text Data True or False
CountVectorizer with Disaster Tweets
Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.
When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.
Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line wonβt execute and you can test your code after each step.
We are going to bring in a new dataset for you to practice on.
This dataset contains a text column containing tweets associated with disaster keywords and a target column denoting whether a tweet is about a real disaster (1) or not (0). (Source)
In this question, we are going to explore how changing the value of max_features affects our training and cross-validation scores.
Tasks:
- Split the dataset into the feature table
Xand the target valuey.Xwill be the single columntextfrom the dataset wherastargetwill be youry. - Split your data into your training and testing data using a text size of 20% and a random state of 7.
- Make a pipeline with
CountVectorizeras the first step andSVC()as the second. Name the pipelinepipe. - Perform RandomizedSearchCV using the parameters specified in
param_gridand name the searchtweet_search. - Donβt forget to fit your grid search.
- What is the best
max_featuresvalue? Save it in an object nametweet_feats. - What is the best score? Save it in an object named
tweet_val_score. - Score the optimal model on the test set and save it in an object named
tweet_test_score.
NOTE: This may take a few minutes to produce an output. Please be patient.