Introduction to avg_word_len
avg_word_len.Rmd
Background
During Natural Language Processing (NLP), data scientists may want:
- extract additional characteristics from text data
- engineer their own features using these additional characteristics
- train their models with their engineered features for better model results
The avg_word_len
function from the textfeatureinfor
package is one of four functions that aids in this text data extraction process. Specifically, avg_word_len
makes it easy to extract the average length of words inside of a piece of text.
This document provides detailed instructions on how to apply the avg_word_len
function on text data for quick access to a new feature that can be added into various machine learning models.
You can start by loading the textfeatureinfor
package:
library(textfeatureinfor)
Quick example with one piece of text
To understand the basics of the avg_word_len
function, we can take a look at its interaction with a simple text object.
text <- "Star wars is my favourite movie!"
round(avg_word_len(text), 2) # the output of avg_word_len is rounded to 2 decimals
#> [1] 4.33
From this basic example, we can see that the average length of words in this piece of text is 4.33
. Note that the exclamation point, ‘!’, is not counted as a character towards the average word length calculation. Moreover, the following is a list of punctuation marks that are disregarded during average word length calculations:
punc <- c(',','!','"','#','$','%','&',"'",'(',')','*','+','-','.',
'/',':',';','<','=','>','?','@','[',']','^','_','`',
'{','|','}','~')
Example with a dataframe
Now that we understand how to use this function, let’s apply avg_word_len
to an entire data set. This is more practical aspect of this function considering that feature extraction will need to be done for entire data-frames in order to be useful for machine learning models.
Let’s first create a toy data-set with 5 examples:
toy_data <- data.frame(messages = c(
"Hello World",
"Star wars is my favourite movie",
"Never have I ever",
"What a wonderful world",
"You should watch Spiderman: no way home"
))
toy_data
#> messages
#> 1 Hello World
#> 2 Star wars is my favourite movie
#> 3 Never have I ever
#> 4 What a wonderful world
#> 5 You should watch Spiderman: no way home
Now, we will leverage the avg_word_len
function to extract the average word lengths from the toy data and add it as a new column in the toy data frame
# applies avg_word_len function to each row in the data frame
toy_data["average_word_lengths"] <- apply(toy_data["messages"],
1,
avg_word_len)
toy_data["average_word_lengths"] <- round(toy_data["average_word_lengths"], 2)
toy_data
#> messages average_word_lengths
#> 1 Hello World 5.00
#> 2 Star wars is my favourite movie 4.33
#> 3 Never have I ever 3.50
#> 4 What a wonderful world 4.75
#> 5 You should watch Spiderman: no way home 4.57
Now we have access to the average length of the words for each example message in the toy data frame.