Skip to contents

Background

During Natural Language Processing (NLP), data scientists may want:

  • extract additional characteristics from text data
  • engineer their own features using these additional characteristics
  • train their models with their engineered features for better model results

The library textfeatureinfor has four functions, one of which is perc_cap_words. Here we will explore the usage of this function.

Let us first load the textfeatureinfor package:

library(textfeatureinfor)

We can pass a piece of text to this function. This function will calculate the percentage of fully capitalised words in a text. To be more specific, it will first isolate and count the total number of the words in the string. Second, it will count the total number of fully capitalised words. Finally, the percentage of fully capitalised words to total length of string is calculated.

Quick example for one piece of text

To understand the basics of the perc_cap_words function, we can take a look at its interaction with a simple text object.

text <- "TODAY is a GREAT day to be alive!"

perc_cap_words(text)
#> [1] 25

From this basic example, we can see that the percentage of fully capitalised words in this piece of text is 25, indicating 25%.

Example with a dataframe

Now that we understand how to use this function, let’s apply perc_cap_words to an entire data set. This is more practical aspect of this function considering that feature extraction will need to be done for entire data-frames in order to be useful for machine learning models.

Let’s first create a toy data-set with 5 examples:

toy_data <- data.frame(lines = c(
  "TODAY is a GREAT day to be alive!",
  "I can smell the flowers and see flocks of birds overhead.",
  "The weather forecast shows that its going to SNOW tomorrow",
  "So I will make the BEST USE of TODAY",
  "to take a LONG LONG run around the lake near me"
  ))

toy_data
#>                                                        lines
#> 1                          TODAY is a GREAT day to be alive!
#> 2  I can smell the flowers and see flocks of birds overhead.
#> 3 The weather forecast shows that its going to SNOW tomorrow
#> 4                       So I will make the BEST USE of TODAY
#> 5            to take a LONG LONG run around the lake near me

Now, we will leverage the perc_cap_words function to extract the percentage of capitalised words from the toy data and add it as a new column in the toy data frame

# applies perc_cap_words function to each row in the data frame
toy_data["percentage_of_cap_words"] <- apply(toy_data["lines"], 
                                             1,
                                             perc_cap_words) 

toy_data["percentage_of_cap_words"] <- round(toy_data["percentage_of_cap_words"], 2)

toy_data
#>                                                        lines
#> 1                          TODAY is a GREAT day to be alive!
#> 2  I can smell the flowers and see flocks of birds overhead.
#> 3 The weather forecast shows that its going to SNOW tomorrow
#> 4                       So I will make the BEST USE of TODAY
#> 5            to take a LONG LONG run around the lake near me
#>   percentage_of_cap_words
#> 1                   25.00
#> 2                    9.09
#> 3                   10.00
#> 4                   44.44
#> 5                   18.18

Now we have access to the percentage of capitalised words for each example message in the toy data frame.

Summary

This function is useful in sentiment analysis in Natural Language Processing (NLP). The use of fully capitalised words is associated with increased levels of excitement, anticipation, or stress.