Introduction to remove_stop_words
remove_stop_words.Rmd
Background
During Natural Language Processing (NLP), data scientists may want:
- extract additional characteristics from text data
- engineer their own features using these additional characteristics
- train their models with their engineered features for better model results
The remove_stop_words
function from the textfeatureinfor
package is one of four functions that aids in this text data extraction process. Specifically, remove_stop_words
makes it easy to remove stopwords from a piece of text.
This document provides detailed instructions on how to apply the remove_stop_words
function on text data for quick access to a new feature that can be added into various machine learning models.
You can start by loading the textfeatureinfor
package:
library(textfeatureinfor)
Quick example with one piece of text
To understand the basics of the remove_stop_words
function, we can take a look at its interaction with a simple text object.
text <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."
remove_stop_words(text)
#> [1] "infinite" "universe" "human" "stupidity" "universe"
From this basic example, we can see that only 5 words are not stopwords in this quote from Albert Einstein.
Example with a dataframe
Now that we understand how to use this function, let’s apply remove_stop_words
to an entire data set. This is more practical aspect of this function considering that feature extraction will need to be done for entire data-frames in order to be useful for machine learning models.
Let’s first create a toy data-set with 5 examples:
toy_data <- data.frame(messages = c(
"UBC MDS is an amazing program!",
"I don't like documentaries.",
"I am impressed by what machine learning is capable of doing.",
"NLP is a great way of processing text data.",
"Isn't that AMAZING? Only two more blocks to go!!"
))
toy_data
#> messages
#> 1 UBC MDS is an amazing program!
#> 2 I don't like documentaries.
#> 3 I am impressed by what machine learning is capable of doing.
#> 4 NLP is a great way of processing text data.
#> 5 Isn't that AMAZING? Only two more blocks to go!!
Now, we will leverage the remove_stop_words
function to extract non-stopwords from the toy data and add it as a new column in the toy data frame:
# applies remove_stop_words function to each row in the data frame
toy_data <- toy_data |>
rowwise() |>
mutate(non_stopwords = list(remove_stop_words(messages))) |>
unnest(cols = c(non_stopwords))
toy_data
#> # A tibble: 14 × 2
#> messages non_stopwords
#> <chr> <chr>
#> 1 UBC MDS is an amazing program! ubc
#> 2 UBC MDS is an amazing program! mds
#> 3 UBC MDS is an amazing program! amazing
#> 4 UBC MDS is an amazing program! program
#> 5 I don't like documentaries. documentaries
#> 6 I am impressed by what machine learning is capable of doing. impressed
#> 7 I am impressed by what machine learning is capable of doing. machine
#> 8 I am impressed by what machine learning is capable of doing. learning
#> 9 I am impressed by what machine learning is capable of doing. capable
#> 10 NLP is a great way of processing text data. nlp
#> 11 NLP is a great way of processing text data. processing
#> 12 NLP is a great way of processing text data. data
#> 13 Isn't that AMAZING? Only two more blocks to go!! amazing
#> 14 Isn't that AMAZING? Only two more blocks to go!! blocks
Now we have access to non-stopwords for each example message in the toy data frame.