Example Usage

BrokkR package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.

This notebook gives an example on how to use BrokkR in a project.

In this example we start with a selection of a few top universities in Canada, and:

Function input output
create_id() a list of url’s a list of unique url_id’s
brok_scrape() a list of url’s a list of scraped raw text
duster() a list of url’s a list where the outputs of create_id() and brok_scrape() are concatonated
bow() the output of duster() a list of bag of words appended to the input list.

List of url’s

Here is the list of university urls that will be used in this example:

Imports

library(BrokkR)

Example input

According to the list of universities mentioned above, here is a sample input we need for some functions in this package:

urls <-c('https://www.utoronto.ca/',
         'https://www.ubc.ca/',
         'https://www.mcgill.ca/',
         'https://www.queensu.ca/')

create_id()

Create unique ID’s for a list of urls.

url_ids <- create_id(urls)
url_ids
#> [1] "utoronto" "ubc"      "mcgill"   "queensu"

brok_scrape():

Create a list of outputs which consist the raw text of webpages listed in the input.

list_rawtext <- brok_scrape(urls)
#> Test passed 🥳
#> Test passed 😸
#> Test passed 😀
#> Test passed 🥳
list_rawtext[[1]]
#> [1] "Skip to main content\nEmail\nQuercus\nAcorn\nJump To\nNews & Media\nAbout U of T\nGive To U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\nSearch\nEmail\nQuercus\nAcorn\nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\nNews & Media\nAbout U of T\nGive to U of T\nAcademics\nResearch & Innovation\nUniversity Life\nLibraries\nPrograms of Study\nA to Z\nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\nWhat can we help you with?\n<a href = \"/news/u-t-centre-india-launches-mumbai-partnership-tata-trusts?utm_source=UofTHome&amp;utm_medium=WebsiteBanner&amp;utm_content=UofTCentreIndiaLaunches\"> U of T News\nU of T Centre in India launches in Mumbai in partnership with Tata Trusts\n[node:body]\nMore\n\nCampus Status\nok\n\n \n\nYour guide to the U of T community\n\nVisit UTogether\nLatest news\nFebruary 3, 2023\nCreating 'Julia,' the first Muppet on the autism spectrum\nFebruary 3, 2023\nCanada’s Walk of Fame Awards honour three luminaries with U of T ties\nFebruary 2, 2023\nWith latest work, U of T grad challenges children's book genre\n\nMORE U OF T NEWS\n\nUpcoming events\n\nUpcoming Events - Home\nFebruary 2, 2023\nRe/Viewing, Re/Visioning, and Re/Imagining Black Canada Symposium\nFebruary 2, 2023\nBook Launch and Concert with Amir Issaa\nFebruary 3, 2023\nResearch Program Planning Workshop III\n\nMORE Events\n\nU of T Celebrates\n\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\n\nExplore U of T Celebrates\nRESEARCH & INNOVATION\nAngela Schoellig recognized with Arthur B. McDonald Fellowship\n<a href = \"/uoft-world\"> <img src = \"https://www.utoronto.ca/sites/default/files/UOT-ThreeLine-Blue_0_0_4_0_1.png\" alt = \"U of T World logo\"> </a>\n\nIn our latest issue: Meet the U of T alum pushing the boundaries of what computers can do; addressing Canada’s racial health gap; investigating crime, one fake body at a time. Plus: Ukrainian students find a haven at U of T; the future of urban farming and more\n\n \n\nExplore the issue\n\nProfile People The Costs of Extraction Kristen Bos investigates how pollution has affected – and continues to affect – Indigenous communities\n\nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\nNews & Media\nAbout U of T\nGive to U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n\nContacts\nCareers\nAccessibility\nPrivacy\nSite Feedback\nSite Map\nSt. George Campus\nMississauga Campus\nScarborough Campus\nCampus Maps\nCampus Safety\nok\n\nStatement of Land Acknowledgement\n\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land. Read about U of T’s Statement of Land Acknowledgement.\n\nSOCIAL MEDIA DIRECTORY\n\nUNIVERSITY OF TORONTO - SINCE 1827"

duster()

Create a list out of the outputs of create_id() and brok_scrape()

duster_list <- duster(urls)
#> Test passed 🥳
#> Test passed 🥳
#> Test passed 🎊
#> Test passed 🥇
duster_list
#> # A tibble: 4 × 3
#>   id       raw_url                  raw_text                                    
#>   <chr>    <chr>                    <chr>                                       
#> 1 utoronto https://www.utoronto.ca/ "Skip to main content Email Quercus Acorn J…
#> 2 ubc      https://www.ubc.ca/      "Skip to main content Search UBC Search Glo…
#> 3 mcgill   https://www.mcgill.ca/   "Skip to main content McGill University Adm…
#> 4 queensu  https://www.queensu.ca/  "(function(w,d,s,l,i){w[l]=w[l]||[];w[l].pu…

bow()

Create a list of bag of words appended to the input dataframe.

bag_of_words <- bow(duster_list)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
select(bag_of_words, starts_with("ne"))
#>   need net netid network neuroscience new news next
#> 1    0   0     0       0            0   0    7    0
#> 2    1   0     0       0            0   0    1    0
#> 3    0   1     0       1            1   5    4    3
#> 4    1   0     1       0            0   2    1    0

The bag_of_words is going to be a slightly well-shaped list which we always need to start with in our machine learning projects.