Introduction to remove_stop_words • textfeatureinfor

Background

During Natural Language Processing (NLP), data scientists may want:

extract additional characteristics from text data
engineer their own features using these additional characteristics
train their models with their engineered features for better model results

The remove_stop_words function from the textfeatureinfor package is one of four functions that aids in this text data extraction process. Specifically, remove_stop_words makes it easy to remove stopwords from a piece of text.

This document provides detailed instructions on how to apply the remove_stop_words function on text data for quick access to a new feature that can be added into various machine learning models.

You can start by loading the textfeatureinfor package:

library(textfeatureinfor)

Quick example with one piece of text

To understand the basics of the remove_stop_words function, we can take a look at its interaction with a simple text object.

text <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."

remove_stop_words(text)
#> [1] "infinite"  "universe"  "human"     "stupidity" "universe"

From this basic example, we can see that only 5 words are not stopwords in this quote from Albert Einstein.

Example with a dataframe

Now that we understand how to use this function, let’s apply remove_stop_words to an entire data set. This is more practical aspect of this function considering that feature extraction will need to be done for entire data-frames in order to be useful for machine learning models.

Let’s first create a toy data-set with 5 examples:

toy_data <- data.frame(messages = c(
  "UBC MDS is an amazing program!",
  "I don't like documentaries.",
  "I am impressed by what machine learning is capable of doing.",
  "NLP is a great way of processing text data.",
  "Isn't that AMAZING? Only two more blocks to go!!"
  ))
toy_data
#>                                                       messages
#> 1                               UBC MDS is an amazing program!
#> 2                                  I don't like documentaries.
#> 3 I am impressed by what machine learning is capable of doing.
#> 4                  NLP is a great way of processing text data.
#> 5             Isn't that AMAZING? Only two more blocks to go!!

Now, we will leverage the remove_stop_words function to extract non-stopwords from the toy data and add it as a new column in the toy data frame:

# applies remove_stop_words function to each row in the data frame
toy_data <- toy_data |>
    rowwise() |>
    mutate(non_stopwords = list(remove_stop_words(messages))) |>
    unnest(cols = c(non_stopwords))
toy_data
#> # A tibble: 14 × 2
#>    messages                                                     non_stopwords
#>    <chr>                                                        <chr>        
#>  1 UBC MDS is an amazing program!                               ubc          
#>  2 UBC MDS is an amazing program!                               mds          
#>  3 UBC MDS is an amazing program!                               amazing      
#>  4 UBC MDS is an amazing program!                               program      
#>  5 I don't like documentaries.                                  documentaries
#>  6 I am impressed by what machine learning is capable of doing. impressed    
#>  7 I am impressed by what machine learning is capable of doing. machine      
#>  8 I am impressed by what machine learning is capable of doing. learning     
#>  9 I am impressed by what machine learning is capable of doing. capable      
#> 10 NLP is a great way of processing text data.                  nlp          
#> 11 NLP is a great way of processing text data.                  processing   
#> 12 NLP is a great way of processing text data.                  data         
#> 13 Isn't that AMAZING? Only two more blocks to go!!             amazing      
#> 14 Isn't that AMAZING? Only two more blocks to go!!             blocks

Now we have access to non-stopwords for each example message in the toy data frame.

Summary

The remove_stop_words function is simple, but has broad implications. It allows for easier extraction of useful text information in one line of code. This feature is useful for natural language processing and may prove useful when building machine learning models.