6.1. Exercises

Text Data Questions

X = [ "Take me to the river",
    "Drop me in the water",
    "Push me in the river",
    " dip me in the water"]

Text Data True or False

CountVectorizer with Disaster Tweets

Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.

We are going to bring in a new dataset for you to practice on.

This dataset contains a text column containing tweets associated with disaster keywords and a target column denoting whether a tweet is about a real disaster (1) or not (0). (Source)

In this question, we are going to explore how changing the value of max_features affects our training and cross-validation scores.

Tasks:

  • Split the dataset into the feature table X and the target value y. X will be the single column text from the dataset wheras target will be your y.
  • Split your data into your training and testing data using a text size of 20% and a random state of 7.
  • Make a pipeline with CountVectorizer as the first step and SVC() as the second. Name the pipeline pipe.
  • Perform RandomizedSearchCV using the parameters specified in param_grid and name the search tweet_search.
  • Don’t forget to fit your grid search.
  • What is the best max_features value? Save it in an object name tweet_feats.
  • What is the best score? Save it in an object named tweet_val_score.
  • Score the optimal model on the test set and save it in an object named tweet_test_score.

NOTE: This may take a few minutes to produce an output. Please be patient.

Hint 1
  • Are you splitting using train_test_split()
  • Are you using make_pipeline(CountVectorizer(), SVC())?
  • Are you using RandomizedSearchCV() and calling pipe and param_grid as the first 2 arguments?
  • Are you naming the randomized grid search tweet_search?
  • Are you fitting tweet_search?
  • Are you using tweet_search.best_params_['countvectorizer__max_features'] to get the optimal number of features?
  • Are you using tweet_search.best_score_ to get the best validation score?
  • Are you using tweet_search.score(X_test, y_test) to get the test score?
Fully worked solution: