duster.Rd
Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model
duster(urls, test = FALSE)
: vector of target urls as strings
: (Optional) set to TRUE when running testing
df : A tibble with the webpage identifiers as a index, the raw url, and the raw text from the webpage with extra line breaks removed
duster(c('https://www.cnn.com/world', 'https://www.foxnews.com/world', 'https://www.cbc.ca/news/world'))
#> Test passed 🥇
#> Test passed 🎊
#> Test passed 🎉
#> # A tibble: 3 × 3
#> id raw_url raw_text
#> <chr> <chr> <chr>
#> 1 cnn https://www.cnn.com/world "CNN values your feedback 1. How releva…
#> 2 foxnews https://www.foxnews.com/world "Fox News U.S. Politics Media Opinion B…
#> 3 cbc https://www.cbc.ca/news/world "window.__INITIAL_STATE__ = {\"trending…
"url raw_text
id
cnn1 https://www.cnn.com/world World news - breaking news, video, headlines ...
foxnews1 https://www.foxnews.com/world World | Fox NewsFox News U.S.PoliticsMediaOp...
cbc1 https://www.cbc.ca/news/world World - CBC NewsContentSkip to Main ContentAcc..."
#> [1] "url raw_text\nid\ncnn1 https://www.cnn.com/world World news - breaking news, video, headlines ...\nfoxnews1 https://www.foxnews.com/world World | Fox NewsFox News U.S.PoliticsMediaOp...\ncbc1 https://www.cbc.ca/news/world World - CBC NewsContentSkip to Main ContentAcc..."