Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model

duster(urls, test = FALSE)

Arguments

urls

: vector of target urls as strings

test

: (Optional) set to TRUE when running testing

Value

df : A tibble with the webpage identifiers as a index, the raw url, and the raw text from the webpage with extra line breaks removed

Examples

duster(c('https://www.cnn.com/world', 'https://www.foxnews.com/world', 'https://www.cbc.ca/news/world'))
#> Test passed 🥇
#> Test passed 🎊
#> Test passed 🎉
#> # A tibble: 3 × 3
#>   id      raw_url                       raw_text                                
#>   <chr>   <chr>                         <chr>                                   
#> 1 cnn     https://www.cnn.com/world     "CNN values your feedback 1. How releva…
#> 2 foxnews https://www.foxnews.com/world "Fox News U.S. Politics Media Opinion B…
#> 3 cbc     https://www.cbc.ca/news/world "window.__INITIAL_STATE__ = {\"trending…
            "url                                           raw_text
id
cnn1          https://www.cnn.com/world   World news - breaking news, video, headlines ...
foxnews1  https://www.foxnews.com/world  World | Fox NewsFox News   U.S.PoliticsMediaOp...
cbc1      https://www.cbc.ca/news/world  World - CBC NewsContentSkip to Main ContentAcc..."
#> [1] "url                                           raw_text\nid\ncnn1          https://www.cnn.com/world   World news - breaking news, video, headlines ...\nfoxnews1  https://www.foxnews.com/world  World | Fox NewsFox News   U.S.PoliticsMediaOp...\ncbc1      https://www.cbc.ca/news/world  World - CBC NewsContentSkip to Main ContentAcc..."