Skip to contents

Background

During Natural Language Processing (NLP), data scientists may want:

  • extract additional characteristics from text data
  • engineer their own features using these additional characteristics
  • train their models with their engineered features for better model results

The avg_word_len function from the textfeatureinfor package is one of four functions that aids in this text data extraction process. Specifically, avg_word_len makes it easy to extract the average length of words inside of a piece of text.

This document provides detailed instructions on how to apply the avg_word_len function on text data for quick access to a new feature that can be added into various machine learning models.

You can start by loading the textfeatureinfor package:

library(textfeatureinfor)

Quick example with one piece of text

To understand the basics of the avg_word_len function, we can take a look at its interaction with a simple text object.

text <- "Star wars is my favourite movie!"

round(avg_word_len(text), 2) # the output of avg_word_len is rounded to 2 decimals
#> [1] 4.33

From this basic example, we can see that the average length of words in this piece of text is 4.33. Note that the exclamation point, ‘!’, is not counted as a character towards the average word length calculation. Moreover, the following is a list of punctuation marks that are disregarded during average word length calculations:

punc <- c(',','!','"','#','$','%','&',"'",'(',')','*','+','-','.',
            '/',':',';','<','=','>','?','@','[',']','^','_','`',
            '{','|','}','~')

Example with a dataframe

Now that we understand how to use this function, let’s apply avg_word_len to an entire data set. This is more practical aspect of this function considering that feature extraction will need to be done for entire data-frames in order to be useful for machine learning models.

Let’s first create a toy data-set with 5 examples:

toy_data <- data.frame(messages = c(
  "Hello World",
  "Star wars is my favourite movie",
  "Never have I ever",
  "What a wonderful world",
  "You should watch Spiderman: no way home"
  ))
toy_data
#>                                  messages
#> 1                             Hello World
#> 2         Star wars is my favourite movie
#> 3                       Never have I ever
#> 4                  What a wonderful world
#> 5 You should watch Spiderman: no way home

Now, we will leverage the avg_word_len function to extract the average word lengths from the toy data and add it as a new column in the toy data frame

# applies avg_word_len function to each row in the data frame
toy_data["average_word_lengths"] <- apply(toy_data["messages"], 
                                          1,
                                          avg_word_len) 

toy_data["average_word_lengths"] <- round(toy_data["average_word_lengths"], 2)
toy_data
#>                                  messages average_word_lengths
#> 1                             Hello World                 5.00
#> 2         Star wars is my favourite movie                 4.33
#> 3                       Never have I ever                 3.50
#> 4                  What a wonderful world                 4.75
#> 5 You should watch Spiderman: no way home                 4.57

Now we have access to the average length of the words for each example message in the toy data frame.

Summary

 

The avg_word_len function is simple, but has broad implications. It allows for easier extraction of useful text information in one line of code. This feature is useful for natural language processing and may prove useful when building machine learning models.