26  Web Scraping with R

Web scraping is the process of capturing data from a website using a program. The following introduces the topic of webscraping by using R.

26.1 Ethics of webscraping

See the first section of this page for a discussion of the ethics and legal issues related to webscraping:

https://r4ds.hadley.nz/webscraping

26.2 How to webscrape with R

##################################################################.
##################################################################.
##
## In this file:
##
## - Using the rvest package to scrape information from a
##   basic HTML website
##
## - Packaging up the code to scrape a page into a function
##
## - Using a loop to scrape several pages of information
##   whose URLs differ in a predictable way.
##
## - Fixing fragile CSS selectors
## 
## - Intro to regular expressions (regex)
#################################################################.
#################################################################.

#-------------------------------------------------------------------------------
# Some online resources
#
# video demonstrating basics of web scraping in R
# - https://www.youtube.com/watch?v=v8Yh_4oE-Fs&t=275s
#
# Tips for editing HTML in VSCode
# - https://code.visualstudio.com/docs/languages/html
#
# Tips on how to use VSCode in general:
#   https://code.visualstudio.com/docs/getstarted/tips-and-tricks
#
# timing R code
# - https://www.r-bloggers.com/2017/05/5-ways-to-measure-running-time-of-r-code/
#
# video - using RSelenium to scrape dynamic (ie. javascript) webpage
# - https://www.youtube.com/watch?v=CN989KER4pA
#
# CSS selector rules
# - https://flukeout.github.io/ 
# - answers to the "game": https://gist.github.com/humbertodias/b878772e823fd9863a1e4c2415a3f7b6
#
# Intro to regular expressions
# - https://ryanstutorials.net/regular-expressions-tutorial/
#-------------------------------------------------------------------------------

# Install (if necessary) and load the rvest package

# The following two lines will install and load the rvest package.
# I commented out these lines in favor of the one line below
#
#install.packages("rvest")    # installs it on your machine (only need to do this once on your machine)
#library(rvest)              # do this everytime your start RStudio  (also require(rvest) works)

# This one line will accomplish what the two lines above do. However,
# this one line will not install the package if it's not necessary.
if(!require(rvest)) install.packages("rvest")   # this automatically also loads the "xml2" package
Loading required package: rvest
help(package="rvest")
help(package="xml2")

26.3 Download weather data

We’ll be working with the following webpage

https://forecast.weather.gov/MapClick.php?lat=37.7771&lon=-122.4196#.Xl0j6BNKhTY

26.4 Scraping a piece of data

###########################################################################
#
# Code to download high/low temperature data from the following website
#
###########################################################################

url = "https://forecast.weather.gov/MapClick.php?lat=37.7771&lon=-122.4196#.Xl0j6BNKhTY"


# You can get a very sepecific CSS selector for a particular 
# part of a webpage by using the "copy css selector" feature of
# your browser. Every browser has a slightly different way of doing this
# but they are all similar. In Chrome do the following:
#
# (1) right click on the portion of the page you want to scrape
#     and choose "inspect"
# (2) right click on the portion of the HTML that contains what
#     you want and choose "Copy" and then "Copy Selector"
# 
# This gives you a very, very specific selector to get this info.
# 
# For exmaple, we followed this process for the first high temperature 
# for this webpage and got the following selector. 
#
# CSS SELECTOR: (note the leading # on the line below is part of the selector)
#    #seven-day-forecast-list > li:nth-child(1) > div > p.temp.temp-high
#
# WARNING - this is a VERY specific selector! You will often need
# to modify this selector so that it is LESS specific and gets 
# more than just the one piece of info you clicked on.

weatherPage = read_html(url)

cssSelector = "#seven-day-forecast-list > li:nth-child(1) > div > p.temp.temp-high"

temperature = weatherPage %>%
  html_elements( cssSelector ) %>%
  html_text2()

temperature
[1] "High: 68 °F"
x = read_html(url)
x
{html_document}
<html class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n        <main class="container"><header class="row clearfix" id=" ...
y = html_elements(x, cssSelector )
y
{xml_nodeset (1)}
[1] <p class="temp temp-high">High: 68 °F</p>
z = html_text(y)
z
[1] "High: 68 °F"
# What is the structure of x?
str(x)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

26.5 Choosing better CSS Selectors

####################################################################.
# If we analyze the HTML code, a MUCH BETTER selector is .temp
# This selects all html elements that have class="temp"
####################################################################.

#------------------------------------------------------------------.
# NOTE: 
#
# When webscraping you should "play nicely" with the website.
# We already read the page above using the following line of code:
#
#    weatherPage = read_html(url)
#
# Don't do this again. Every time you read the page you are causing
# the website to do some work to generate the page. You also use
# network "bandwidth". 
#------------------------------------------------------------------.

# Don't do this again - we already did it. Read comment above for more info.
#
# weatherPage = read_html(url)

cssSelector = ".temp"

forecasts <- weatherPage %>%
  html_elements( cssSelector ) %>%
  html_text()

forecasts
[1] "High: 68 °F" "Low: 51 °F"  "High: 71 °F" "Low: 51 °F"  "High: 69 °F"
[6] "Low: 51 °F"  "High: 65 °F" "Low: 51 °F"  "High: 68 °F"
x = read_html(url)
x
{html_document}
<html class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n        <main class="container"><header class="row clearfix" id=" ...
y = html_elements(x, cssSelector )
y
{xml_nodeset (9)}
[1] <p class="temp temp-high">High: 68 °F</p>
[2] <p class="temp temp-low">Low: 51 °F</p>
[3] <p class="temp temp-high">High: 71 °F</p>
[4] <p class="temp temp-low">Low: 51 °F</p>
[5] <p class="temp temp-high">High: 69 °F</p>
[6] <p class="temp temp-low">Low: 51 °F</p>
[7] <p class="temp temp-high">High: 65 °F</p>
[8] <p class="temp temp-low">Low: 51 °F</p>
[9] <p class="temp temp-high">High: 68 °F</p>
z = html_text(y)
z
[1] "High: 68 °F" "Low: 51 °F"  "High: 71 °F" "Low: 51 °F"  "High: 69 °F"
[6] "Low: 51 °F"  "High: 65 °F" "Low: 51 °F"  "High: 68 °F"
# What is the structure of x?
str(x)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

26.6 Scraping country names

###########################################################################
#
# Get the country names off the following website
#
###########################################################################

url = "https://scrapethissite.com/pages/simple/"
cssSelector = ".country-name"

countries <- read_html(url) %>%
  html_elements( cssSelector ) %>%
  html_text()

x = read_html(url)
x
{html_document}
<html lang="en">
 [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF- ...
 [2] <body>\n    <nav id="site-nav"><div class="container">\n                 ...
 [3] <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery. ...
 [4] <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstra ...
 [5] <script src="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotif ...
 [6] <link href="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify ...
 [7] <script type="text/javascript">\n    \n    PNotify.prototype.options.sty ...
 [8] <script type="text/javascript">\n    $("video").hover(function() {\n     ...
 [9] <script>\n    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r] ...
[10] <script>\n  !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){ ...
[11] <noscript><img height="1" width="1" style="display:none" src="https://ww ...
[12] <script type="text/javascript">\n    /* <![CDATA[ */\n    var google_con ...
[13] <script type="text/javascript" src="//www.googleadservices.com/pagead/co ...
[14] <noscript>\n    <div style="display:inline;">\n    <img height="1" width ...
[15] <script async src="https://www.googletagmanager.com/gtag/js?id=AW-950945 ...
[16] <script>\n   window.dataLayer = window.dataLayer || [];\n   function gta ...
y = html_elements(x, cssSelector)
y
{xml_nodeset (250)}
 [1] <h3 class="country-name">\n                            <i class="flag-ic ...
 [2] <h3 class="country-name">\n                            <i class="flag-ic ...
 [3] <h3 class="country-name">\n                            <i class="flag-ic ...
 [4] <h3 class="country-name">\n                            <i class="flag-ic ...
 [5] <h3 class="country-name">\n                            <i class="flag-ic ...
 [6] <h3 class="country-name">\n                            <i class="flag-ic ...
 [7] <h3 class="country-name">\n                            <i class="flag-ic ...
 [8] <h3 class="country-name">\n                            <i class="flag-ic ...
 [9] <h3 class="country-name">\n                            <i class="flag-ic ...
[10] <h3 class="country-name">\n                            <i class="flag-ic ...
[11] <h3 class="country-name">\n                            <i class="flag-ic ...
[12] <h3 class="country-name">\n                            <i class="flag-ic ...
[13] <h3 class="country-name">\n                            <i class="flag-ic ...
[14] <h3 class="country-name">\n                            <i class="flag-ic ...
[15] <h3 class="country-name">\n                            <i class="flag-ic ...
[16] <h3 class="country-name">\n                            <i class="flag-ic ...
[17] <h3 class="country-name">\n                            <i class="flag-ic ...
[18] <h3 class="country-name">\n                            <i class="flag-ic ...
[19] <h3 class="country-name">\n                            <i class="flag-ic ...
[20] <h3 class="country-name">\n                            <i class="flag-ic ...
...
z = html_text(y)
z
  [1] "\n                            \n                            Andorra\n                        "                                     
  [2] "\n                            \n                            United Arab Emirates\n                        "                        
  [3] "\n                            \n                            Afghanistan\n                        "                                 
  [4] "\n                            \n                            Antigua and Barbuda\n                        "                         
  [5] "\n                            \n                            Anguilla\n                        "                                    
  [6] "\n                            \n                            Albania\n                        "                                     
  [7] "\n                            \n                            Armenia\n                        "                                     
  [8] "\n                            \n                            Angola\n                        "                                      
  [9] "\n                            \n                            Antarctica\n                        "                                  
 [10] "\n                            \n                            Argentina\n                        "                                   
 [11] "\n                            \n                            American Samoa\n                        "                              
 [12] "\n                            \n                            Austria\n                        "                                     
 [13] "\n                            \n                            Australia\n                        "                                   
 [14] "\n                            \n                            Aruba\n                        "                                       
 [15] "\n                            \n                            Åland\n                        "                                       
 [16] "\n                            \n                            Azerbaijan\n                        "                                  
 [17] "\n                            \n                            Bosnia and Herzegovina\n                        "                      
 [18] "\n                            \n                            Barbados\n                        "                                    
 [19] "\n                            \n                            Bangladesh\n                        "                                  
 [20] "\n                            \n                            Belgium\n                        "                                     
 [21] "\n                            \n                            Burkina Faso\n                        "                                
 [22] "\n                            \n                            Bulgaria\n                        "                                    
 [23] "\n                            \n                            Bahrain\n                        "                                     
 [24] "\n                            \n                            Burundi\n                        "                                     
 [25] "\n                            \n                            Benin\n                        "                                       
 [26] "\n                            \n                            Saint Barthélemy\n                        "                            
 [27] "\n                            \n                            Bermuda\n                        "                                     
 [28] "\n                            \n                            Brunei\n                        "                                      
 [29] "\n                            \n                            Bolivia\n                        "                                     
 [30] "\n                            \n                            Bonaire\n                        "                                     
 [31] "\n                            \n                            Brazil\n                        "                                      
 [32] "\n                            \n                            Bahamas\n                        "                                     
 [33] "\n                            \n                            Bhutan\n                        "                                      
 [34] "\n                            \n                            Bouvet Island\n                        "                               
 [35] "\n                            \n                            Botswana\n                        "                                    
 [36] "\n                            \n                            Belarus\n                        "                                     
 [37] "\n                            \n                            Belize\n                        "                                      
 [38] "\n                            \n                            Canada\n                        "                                      
 [39] "\n                            \n                            Cocos [Keeling] Islands\n                        "                     
 [40] "\n                            \n                            Democratic Republic of the Congo\n                        "            
 [41] "\n                            \n                            Central African Republic\n                        "                    
 [42] "\n                            \n                            Republic of the Congo\n                        "                       
 [43] "\n                            \n                            Switzerland\n                        "                                 
 [44] "\n                            \n                            Ivory Coast\n                        "                                 
 [45] "\n                            \n                            Cook Islands\n                        "                                
 [46] "\n                            \n                            Chile\n                        "                                       
 [47] "\n                            \n                            Cameroon\n                        "                                    
 [48] "\n                            \n                            China\n                        "                                       
 [49] "\n                            \n                            Colombia\n                        "                                    
 [50] "\n                            \n                            Costa Rica\n                        "                                  
 [51] "\n                            \n                            Cuba\n                        "                                        
 [52] "\n                            \n                            Cape Verde\n                        "                                  
 [53] "\n                            \n                            Curacao\n                        "                                     
 [54] "\n                            \n                            Christmas Island\n                        "                            
 [55] "\n                            \n                            Cyprus\n                        "                                      
 [56] "\n                            \n                            Czech Republic\n                        "                              
 [57] "\n                            \n                            Germany\n                        "                                     
 [58] "\n                            \n                            Djibouti\n                        "                                    
 [59] "\n                            \n                            Denmark\n                        "                                     
 [60] "\n                            \n                            Dominica\n                        "                                    
 [61] "\n                            \n                            Dominican Republic\n                        "                          
 [62] "\n                            \n                            Algeria\n                        "                                     
 [63] "\n                            \n                            Ecuador\n                        "                                     
 [64] "\n                            \n                            Estonia\n                        "                                     
 [65] "\n                            \n                            Egypt\n                        "                                       
 [66] "\n                            \n                            Western Sahara\n                        "                              
 [67] "\n                            \n                            Eritrea\n                        "                                     
 [68] "\n                            \n                            Spain\n                        "                                       
 [69] "\n                            \n                            Ethiopia\n                        "                                    
 [70] "\n                            \n                            Finland\n                        "                                     
 [71] "\n                            \n                            Fiji\n                        "                                        
 [72] "\n                            \n                            Falkland Islands\n                        "                            
 [73] "\n                            \n                            Micronesia\n                        "                                  
 [74] "\n                            \n                            Faroe Islands\n                        "                               
 [75] "\n                            \n                            France\n                        "                                      
 [76] "\n                            \n                            Gabon\n                        "                                       
 [77] "\n                            \n                            United Kingdom\n                        "                              
 [78] "\n                            \n                            Grenada\n                        "                                     
 [79] "\n                            \n                            Georgia\n                        "                                     
 [80] "\n                            \n                            French Guiana\n                        "                               
 [81] "\n                            \n                            Guernsey\n                        "                                    
 [82] "\n                            \n                            Ghana\n                        "                                       
 [83] "\n                            \n                            Gibraltar\n                        "                                   
 [84] "\n                            \n                            Greenland\n                        "                                   
 [85] "\n                            \n                            Gambia\n                        "                                      
 [86] "\n                            \n                            Guinea\n                        "                                      
 [87] "\n                            \n                            Guadeloupe\n                        "                                  
 [88] "\n                            \n                            Equatorial Guinea\n                        "                           
 [89] "\n                            \n                            Greece\n                        "                                      
 [90] "\n                            \n                            South Georgia and the South Sandwich Islands\n                        "
 [91] "\n                            \n                            Guatemala\n                        "                                   
 [92] "\n                            \n                            Guam\n                        "                                        
 [93] "\n                            \n                            Guinea-Bissau\n                        "                               
 [94] "\n                            \n                            Guyana\n                        "                                      
 [95] "\n                            \n                            Hong Kong\n                        "                                   
 [96] "\n                            \n                            Heard Island and McDonald Islands\n                        "           
 [97] "\n                            \n                            Honduras\n                        "                                    
 [98] "\n                            \n                            Croatia\n                        "                                     
 [99] "\n                            \n                            Haiti\n                        "                                       
[100] "\n                            \n                            Hungary\n                        "                                     
[101] "\n                            \n                            Indonesia\n                        "                                   
[102] "\n                            \n                            Ireland\n                        "                                     
[103] "\n                            \n                            Israel\n                        "                                      
[104] "\n                            \n                            Isle of Man\n                        "                                 
[105] "\n                            \n                            India\n                        "                                       
[106] "\n                            \n                            British Indian Ocean Territory\n                        "              
[107] "\n                            \n                            Iraq\n                        "                                        
[108] "\n                            \n                            Iran\n                        "                                        
[109] "\n                            \n                            Iceland\n                        "                                     
[110] "\n                            \n                            Italy\n                        "                                       
[111] "\n                            \n                            Jersey\n                        "                                      
[112] "\n                            \n                            Jamaica\n                        "                                     
[113] "\n                            \n                            Jordan\n                        "                                      
[114] "\n                            \n                            Japan\n                        "                                       
[115] "\n                            \n                            Kenya\n                        "                                       
[116] "\n                            \n                            Kyrgyzstan\n                        "                                  
[117] "\n                            \n                            Cambodia\n                        "                                    
[118] "\n                            \n                            Kiribati\n                        "                                    
[119] "\n                            \n                            Comoros\n                        "                                     
[120] "\n                            \n                            Saint Kitts and Nevis\n                        "                       
[121] "\n                            \n                            North Korea\n                        "                                 
[122] "\n                            \n                            South Korea\n                        "                                 
[123] "\n                            \n                            Kuwait\n                        "                                      
[124] "\n                            \n                            Cayman Islands\n                        "                              
[125] "\n                            \n                            Kazakhstan\n                        "                                  
[126] "\n                            \n                            Laos\n                        "                                        
[127] "\n                            \n                            Lebanon\n                        "                                     
[128] "\n                            \n                            Saint Lucia\n                        "                                 
[129] "\n                            \n                            Liechtenstein\n                        "                               
[130] "\n                            \n                            Sri Lanka\n                        "                                   
[131] "\n                            \n                            Liberia\n                        "                                     
[132] "\n                            \n                            Lesotho\n                        "                                     
[133] "\n                            \n                            Lithuania\n                        "                                   
[134] "\n                            \n                            Luxembourg\n                        "                                  
[135] "\n                            \n                            Latvia\n                        "                                      
[136] "\n                            \n                            Libya\n                        "                                       
[137] "\n                            \n                            Morocco\n                        "                                     
[138] "\n                            \n                            Monaco\n                        "                                      
[139] "\n                            \n                            Moldova\n                        "                                     
[140] "\n                            \n                            Montenegro\n                        "                                  
[141] "\n                            \n                            Saint Martin\n                        "                                
[142] "\n                            \n                            Madagascar\n                        "                                  
[143] "\n                            \n                            Marshall Islands\n                        "                            
[144] "\n                            \n                            Macedonia\n                        "                                   
[145] "\n                            \n                            Mali\n                        "                                        
[146] "\n                            \n                            Myanmar [Burma]\n                        "                             
[147] "\n                            \n                            Mongolia\n                        "                                    
[148] "\n                            \n                            Macao\n                        "                                       
[149] "\n                            \n                            Northern Mariana Islands\n                        "                    
[150] "\n                            \n                            Martinique\n                        "                                  
[151] "\n                            \n                            Mauritania\n                        "                                  
[152] "\n                            \n                            Montserrat\n                        "                                  
[153] "\n                            \n                            Malta\n                        "                                       
[154] "\n                            \n                            Mauritius\n                        "                                   
[155] "\n                            \n                            Maldives\n                        "                                    
[156] "\n                            \n                            Malawi\n                        "                                      
[157] "\n                            \n                            Mexico\n                        "                                      
[158] "\n                            \n                            Malaysia\n                        "                                    
[159] "\n                            \n                            Mozambique\n                        "                                  
[160] "\n                            \n                            Namibia\n                        "                                     
[161] "\n                            \n                            New Caledonia\n                        "                               
[162] "\n                            \n                            Niger\n                        "                                       
[163] "\n                            \n                            Norfolk Island\n                        "                              
[164] "\n                            \n                            Nigeria\n                        "                                     
[165] "\n                            \n                            Nicaragua\n                        "                                   
[166] "\n                            \n                            Netherlands\n                        "                                 
[167] "\n                            \n                            Norway\n                        "                                      
[168] "\n                            \n                            Nepal\n                        "                                       
[169] "\n                            \n                            Nauru\n                        "                                       
[170] "\n                            \n                            Niue\n                        "                                        
[171] "\n                            \n                            New Zealand\n                        "                                 
[172] "\n                            \n                            Oman\n                        "                                        
[173] "\n                            \n                            Panama\n                        "                                      
[174] "\n                            \n                            Peru\n                        "                                        
[175] "\n                            \n                            French Polynesia\n                        "                            
[176] "\n                            \n                            Papua New Guinea\n                        "                            
[177] "\n                            \n                            Philippines\n                        "                                 
[178] "\n                            \n                            Pakistan\n                        "                                    
[179] "\n                            \n                            Poland\n                        "                                      
[180] "\n                            \n                            Saint Pierre and Miquelon\n                        "                   
[181] "\n                            \n                            Pitcairn Islands\n                        "                            
[182] "\n                            \n                            Puerto Rico\n                        "                                 
[183] "\n                            \n                            Palestine\n                        "                                   
[184] "\n                            \n                            Portugal\n                        "                                    
[185] "\n                            \n                            Palau\n                        "                                       
[186] "\n                            \n                            Paraguay\n                        "                                    
[187] "\n                            \n                            Qatar\n                        "                                       
[188] "\n                            \n                            Réunion\n                        "                                     
[189] "\n                            \n                            Romania\n                        "                                     
[190] "\n                            \n                            Serbia\n                        "                                      
[191] "\n                            \n                            Russia\n                        "                                      
[192] "\n                            \n                            Rwanda\n                        "                                      
[193] "\n                            \n                            Saudi Arabia\n                        "                                
[194] "\n                            \n                            Solomon Islands\n                        "                             
[195] "\n                            \n                            Seychelles\n                        "                                  
[196] "\n                            \n                            Sudan\n                        "                                       
[197] "\n                            \n                            Sweden\n                        "                                      
[198] "\n                            \n                            Singapore\n                        "                                   
[199] "\n                            \n                            Saint Helena\n                        "                                
[200] "\n                            \n                            Slovenia\n                        "                                    
[201] "\n                            \n                            Svalbard and Jan Mayen\n                        "                      
[202] "\n                            \n                            Slovakia\n                        "                                    
[203] "\n                            \n                            Sierra Leone\n                        "                                
[204] "\n                            \n                            San Marino\n                        "                                  
[205] "\n                            \n                            Senegal\n                        "                                     
[206] "\n                            \n                            Somalia\n                        "                                     
[207] "\n                            \n                            Suriname\n                        "                                    
[208] "\n                            \n                            South Sudan\n                        "                                 
[209] "\n                            \n                            São Tomé and Príncipe\n                        "                       
[210] "\n                            \n                            El Salvador\n                        "                                 
[211] "\n                            \n                            Sint Maarten\n                        "                                
[212] "\n                            \n                            Syria\n                        "                                       
[213] "\n                            \n                            Swaziland\n                        "                                   
[214] "\n                            \n                            Turks and Caicos Islands\n                        "                    
[215] "\n                            \n                            Chad\n                        "                                        
[216] "\n                            \n                            French Southern Territories\n                        "                 
[217] "\n                            \n                            Togo\n                        "                                        
[218] "\n                            \n                            Thailand\n                        "                                    
[219] "\n                            \n                            Tajikistan\n                        "                                  
[220] "\n                            \n                            Tokelau\n                        "                                     
[221] "\n                            \n                            East Timor\n                        "                                  
[222] "\n                            \n                            Turkmenistan\n                        "                                
[223] "\n                            \n                            Tunisia\n                        "                                     
[224] "\n                            \n                            Tonga\n                        "                                       
[225] "\n                            \n                            Turkey\n                        "                                      
[226] "\n                            \n                            Trinidad and Tobago\n                        "                         
[227] "\n                            \n                            Tuvalu\n                        "                                      
[228] "\n                            \n                            Taiwan\n                        "                                      
[229] "\n                            \n                            Tanzania\n                        "                                    
[230] "\n                            \n                            Ukraine\n                        "                                     
[231] "\n                            \n                            Uganda\n                        "                                      
[232] "\n                            \n                            U.S. Minor Outlying Islands\n                        "                 
[233] "\n                            \n                            United States\n                        "                               
[234] "\n                            \n                            Uruguay\n                        "                                     
[235] "\n                            \n                            Uzbekistan\n                        "                                  
[236] "\n                            \n                            Vatican City\n                        "                                
[237] "\n                            \n                            Saint Vincent and the Grenadines\n                        "            
[238] "\n                            \n                            Venezuela\n                        "                                   
[239] "\n                            \n                            British Virgin Islands\n                        "                      
[240] "\n                            \n                            U.S. Virgin Islands\n                        "                         
[241] "\n                            \n                            Vietnam\n                        "                                     
[242] "\n                            \n                            Vanuatu\n                        "                                     
[243] "\n                            \n                            Wallis and Futuna\n                        "                           
[244] "\n                            \n                            Samoa\n                        "                                       
[245] "\n                            \n                            Kosovo\n                        "                                      
[246] "\n                            \n                            Yemen\n                        "                                       
[247] "\n                            \n                            Mayotte\n                        "                                     
[248] "\n                            \n                            South Africa\n                        "                                
[249] "\n                            \n                            Zambia\n                        "                                      
[250] "\n                            \n                            Zimbabwe\n                        "                                    
#..............................................................................
# Let's examine the results.
# 
# Notice that the results includes the newlines (\n) and blanks from the .html
# file. This is because we picked up EVERYTHING that appears between the
# start tag and end tag in the HTML.
#
# Below, we will see how to fix up the results below by using the gsub function.
#..............................................................................

countries  
  [1] "\n                            \n                            Andorra\n                        "                                     
  [2] "\n                            \n                            United Arab Emirates\n                        "                        
  [3] "\n                            \n                            Afghanistan\n                        "                                 
  [4] "\n                            \n                            Antigua and Barbuda\n                        "                         
  [5] "\n                            \n                            Anguilla\n                        "                                    
  [6] "\n                            \n                            Albania\n                        "                                     
  [7] "\n                            \n                            Armenia\n                        "                                     
  [8] "\n                            \n                            Angola\n                        "                                      
  [9] "\n                            \n                            Antarctica\n                        "                                  
 [10] "\n                            \n                            Argentina\n                        "                                   
 [11] "\n                            \n                            American Samoa\n                        "                              
 [12] "\n                            \n                            Austria\n                        "                                     
 [13] "\n                            \n                            Australia\n                        "                                   
 [14] "\n                            \n                            Aruba\n                        "                                       
 [15] "\n                            \n                            Åland\n                        "                                       
 [16] "\n                            \n                            Azerbaijan\n                        "                                  
 [17] "\n                            \n                            Bosnia and Herzegovina\n                        "                      
 [18] "\n                            \n                            Barbados\n                        "                                    
 [19] "\n                            \n                            Bangladesh\n                        "                                  
 [20] "\n                            \n                            Belgium\n                        "                                     
 [21] "\n                            \n                            Burkina Faso\n                        "                                
 [22] "\n                            \n                            Bulgaria\n                        "                                    
 [23] "\n                            \n                            Bahrain\n                        "                                     
 [24] "\n                            \n                            Burundi\n                        "                                     
 [25] "\n                            \n                            Benin\n                        "                                       
 [26] "\n                            \n                            Saint Barthélemy\n                        "                            
 [27] "\n                            \n                            Bermuda\n                        "                                     
 [28] "\n                            \n                            Brunei\n                        "                                      
 [29] "\n                            \n                            Bolivia\n                        "                                     
 [30] "\n                            \n                            Bonaire\n                        "                                     
 [31] "\n                            \n                            Brazil\n                        "                                      
 [32] "\n                            \n                            Bahamas\n                        "                                     
 [33] "\n                            \n                            Bhutan\n                        "                                      
 [34] "\n                            \n                            Bouvet Island\n                        "                               
 [35] "\n                            \n                            Botswana\n                        "                                    
 [36] "\n                            \n                            Belarus\n                        "                                     
 [37] "\n                            \n                            Belize\n                        "                                      
 [38] "\n                            \n                            Canada\n                        "                                      
 [39] "\n                            \n                            Cocos [Keeling] Islands\n                        "                     
 [40] "\n                            \n                            Democratic Republic of the Congo\n                        "            
 [41] "\n                            \n                            Central African Republic\n                        "                    
 [42] "\n                            \n                            Republic of the Congo\n                        "                       
 [43] "\n                            \n                            Switzerland\n                        "                                 
 [44] "\n                            \n                            Ivory Coast\n                        "                                 
 [45] "\n                            \n                            Cook Islands\n                        "                                
 [46] "\n                            \n                            Chile\n                        "                                       
 [47] "\n                            \n                            Cameroon\n                        "                                    
 [48] "\n                            \n                            China\n                        "                                       
 [49] "\n                            \n                            Colombia\n                        "                                    
 [50] "\n                            \n                            Costa Rica\n                        "                                  
 [51] "\n                            \n                            Cuba\n                        "                                        
 [52] "\n                            \n                            Cape Verde\n                        "                                  
 [53] "\n                            \n                            Curacao\n                        "                                     
 [54] "\n                            \n                            Christmas Island\n                        "                            
 [55] "\n                            \n                            Cyprus\n                        "                                      
 [56] "\n                            \n                            Czech Republic\n                        "                              
 [57] "\n                            \n                            Germany\n                        "                                     
 [58] "\n                            \n                            Djibouti\n                        "                                    
 [59] "\n                            \n                            Denmark\n                        "                                     
 [60] "\n                            \n                            Dominica\n                        "                                    
 [61] "\n                            \n                            Dominican Republic\n                        "                          
 [62] "\n                            \n                            Algeria\n                        "                                     
 [63] "\n                            \n                            Ecuador\n                        "                                     
 [64] "\n                            \n                            Estonia\n                        "                                     
 [65] "\n                            \n                            Egypt\n                        "                                       
 [66] "\n                            \n                            Western Sahara\n                        "                              
 [67] "\n                            \n                            Eritrea\n                        "                                     
 [68] "\n                            \n                            Spain\n                        "                                       
 [69] "\n                            \n                            Ethiopia\n                        "                                    
 [70] "\n                            \n                            Finland\n                        "                                     
 [71] "\n                            \n                            Fiji\n                        "                                        
 [72] "\n                            \n                            Falkland Islands\n                        "                            
 [73] "\n                            \n                            Micronesia\n                        "                                  
 [74] "\n                            \n                            Faroe Islands\n                        "                               
 [75] "\n                            \n                            France\n                        "                                      
 [76] "\n                            \n                            Gabon\n                        "                                       
 [77] "\n                            \n                            United Kingdom\n                        "                              
 [78] "\n                            \n                            Grenada\n                        "                                     
 [79] "\n                            \n                            Georgia\n                        "                                     
 [80] "\n                            \n                            French Guiana\n                        "                               
 [81] "\n                            \n                            Guernsey\n                        "                                    
 [82] "\n                            \n                            Ghana\n                        "                                       
 [83] "\n                            \n                            Gibraltar\n                        "                                   
 [84] "\n                            \n                            Greenland\n                        "                                   
 [85] "\n                            \n                            Gambia\n                        "                                      
 [86] "\n                            \n                            Guinea\n                        "                                      
 [87] "\n                            \n                            Guadeloupe\n                        "                                  
 [88] "\n                            \n                            Equatorial Guinea\n                        "                           
 [89] "\n                            \n                            Greece\n                        "                                      
 [90] "\n                            \n                            South Georgia and the South Sandwich Islands\n                        "
 [91] "\n                            \n                            Guatemala\n                        "                                   
 [92] "\n                            \n                            Guam\n                        "                                        
 [93] "\n                            \n                            Guinea-Bissau\n                        "                               
 [94] "\n                            \n                            Guyana\n                        "                                      
 [95] "\n                            \n                            Hong Kong\n                        "                                   
 [96] "\n                            \n                            Heard Island and McDonald Islands\n                        "           
 [97] "\n                            \n                            Honduras\n                        "                                    
 [98] "\n                            \n                            Croatia\n                        "                                     
 [99] "\n                            \n                            Haiti\n                        "                                       
[100] "\n                            \n                            Hungary\n                        "                                     
[101] "\n                            \n                            Indonesia\n                        "                                   
[102] "\n                            \n                            Ireland\n                        "                                     
[103] "\n                            \n                            Israel\n                        "                                      
[104] "\n                            \n                            Isle of Man\n                        "                                 
[105] "\n                            \n                            India\n                        "                                       
[106] "\n                            \n                            British Indian Ocean Territory\n                        "              
[107] "\n                            \n                            Iraq\n                        "                                        
[108] "\n                            \n                            Iran\n                        "                                        
[109] "\n                            \n                            Iceland\n                        "                                     
[110] "\n                            \n                            Italy\n                        "                                       
[111] "\n                            \n                            Jersey\n                        "                                      
[112] "\n                            \n                            Jamaica\n                        "                                     
[113] "\n                            \n                            Jordan\n                        "                                      
[114] "\n                            \n                            Japan\n                        "                                       
[115] "\n                            \n                            Kenya\n                        "                                       
[116] "\n                            \n                            Kyrgyzstan\n                        "                                  
[117] "\n                            \n                            Cambodia\n                        "                                    
[118] "\n                            \n                            Kiribati\n                        "                                    
[119] "\n                            \n                            Comoros\n                        "                                     
[120] "\n                            \n                            Saint Kitts and Nevis\n                        "                       
[121] "\n                            \n                            North Korea\n                        "                                 
[122] "\n                            \n                            South Korea\n                        "                                 
[123] "\n                            \n                            Kuwait\n                        "                                      
[124] "\n                            \n                            Cayman Islands\n                        "                              
[125] "\n                            \n                            Kazakhstan\n                        "                                  
[126] "\n                            \n                            Laos\n                        "                                        
[127] "\n                            \n                            Lebanon\n                        "                                     
[128] "\n                            \n                            Saint Lucia\n                        "                                 
[129] "\n                            \n                            Liechtenstein\n                        "                               
[130] "\n                            \n                            Sri Lanka\n                        "                                   
[131] "\n                            \n                            Liberia\n                        "                                     
[132] "\n                            \n                            Lesotho\n                        "                                     
[133] "\n                            \n                            Lithuania\n                        "                                   
[134] "\n                            \n                            Luxembourg\n                        "                                  
[135] "\n                            \n                            Latvia\n                        "                                      
[136] "\n                            \n                            Libya\n                        "                                       
[137] "\n                            \n                            Morocco\n                        "                                     
[138] "\n                            \n                            Monaco\n                        "                                      
[139] "\n                            \n                            Moldova\n                        "                                     
[140] "\n                            \n                            Montenegro\n                        "                                  
[141] "\n                            \n                            Saint Martin\n                        "                                
[142] "\n                            \n                            Madagascar\n                        "                                  
[143] "\n                            \n                            Marshall Islands\n                        "                            
[144] "\n                            \n                            Macedonia\n                        "                                   
[145] "\n                            \n                            Mali\n                        "                                        
[146] "\n                            \n                            Myanmar [Burma]\n                        "                             
[147] "\n                            \n                            Mongolia\n                        "                                    
[148] "\n                            \n                            Macao\n                        "                                       
[149] "\n                            \n                            Northern Mariana Islands\n                        "                    
[150] "\n                            \n                            Martinique\n                        "                                  
[151] "\n                            \n                            Mauritania\n                        "                                  
[152] "\n                            \n                            Montserrat\n                        "                                  
[153] "\n                            \n                            Malta\n                        "                                       
[154] "\n                            \n                            Mauritius\n                        "                                   
[155] "\n                            \n                            Maldives\n                        "                                    
[156] "\n                            \n                            Malawi\n                        "                                      
[157] "\n                            \n                            Mexico\n                        "                                      
[158] "\n                            \n                            Malaysia\n                        "                                    
[159] "\n                            \n                            Mozambique\n                        "                                  
[160] "\n                            \n                            Namibia\n                        "                                     
[161] "\n                            \n                            New Caledonia\n                        "                               
[162] "\n                            \n                            Niger\n                        "                                       
[163] "\n                            \n                            Norfolk Island\n                        "                              
[164] "\n                            \n                            Nigeria\n                        "                                     
[165] "\n                            \n                            Nicaragua\n                        "                                   
[166] "\n                            \n                            Netherlands\n                        "                                 
[167] "\n                            \n                            Norway\n                        "                                      
[168] "\n                            \n                            Nepal\n                        "                                       
[169] "\n                            \n                            Nauru\n                        "                                       
[170] "\n                            \n                            Niue\n                        "                                        
[171] "\n                            \n                            New Zealand\n                        "                                 
[172] "\n                            \n                            Oman\n                        "                                        
[173] "\n                            \n                            Panama\n                        "                                      
[174] "\n                            \n                            Peru\n                        "                                        
[175] "\n                            \n                            French Polynesia\n                        "                            
[176] "\n                            \n                            Papua New Guinea\n                        "                            
[177] "\n                            \n                            Philippines\n                        "                                 
[178] "\n                            \n                            Pakistan\n                        "                                    
[179] "\n                            \n                            Poland\n                        "                                      
[180] "\n                            \n                            Saint Pierre and Miquelon\n                        "                   
[181] "\n                            \n                            Pitcairn Islands\n                        "                            
[182] "\n                            \n                            Puerto Rico\n                        "                                 
[183] "\n                            \n                            Palestine\n                        "                                   
[184] "\n                            \n                            Portugal\n                        "                                    
[185] "\n                            \n                            Palau\n                        "                                       
[186] "\n                            \n                            Paraguay\n                        "                                    
[187] "\n                            \n                            Qatar\n                        "                                       
[188] "\n                            \n                            Réunion\n                        "                                     
[189] "\n                            \n                            Romania\n                        "                                     
[190] "\n                            \n                            Serbia\n                        "                                      
[191] "\n                            \n                            Russia\n                        "                                      
[192] "\n                            \n                            Rwanda\n                        "                                      
[193] "\n                            \n                            Saudi Arabia\n                        "                                
[194] "\n                            \n                            Solomon Islands\n                        "                             
[195] "\n                            \n                            Seychelles\n                        "                                  
[196] "\n                            \n                            Sudan\n                        "                                       
[197] "\n                            \n                            Sweden\n                        "                                      
[198] "\n                            \n                            Singapore\n                        "                                   
[199] "\n                            \n                            Saint Helena\n                        "                                
[200] "\n                            \n                            Slovenia\n                        "                                    
[201] "\n                            \n                            Svalbard and Jan Mayen\n                        "                      
[202] "\n                            \n                            Slovakia\n                        "                                    
[203] "\n                            \n                            Sierra Leone\n                        "                                
[204] "\n                            \n                            San Marino\n                        "                                  
[205] "\n                            \n                            Senegal\n                        "                                     
[206] "\n                            \n                            Somalia\n                        "                                     
[207] "\n                            \n                            Suriname\n                        "                                    
[208] "\n                            \n                            South Sudan\n                        "                                 
[209] "\n                            \n                            São Tomé and Príncipe\n                        "                       
[210] "\n                            \n                            El Salvador\n                        "                                 
[211] "\n                            \n                            Sint Maarten\n                        "                                
[212] "\n                            \n                            Syria\n                        "                                       
[213] "\n                            \n                            Swaziland\n                        "                                   
[214] "\n                            \n                            Turks and Caicos Islands\n                        "                    
[215] "\n                            \n                            Chad\n                        "                                        
[216] "\n                            \n                            French Southern Territories\n                        "                 
[217] "\n                            \n                            Togo\n                        "                                        
[218] "\n                            \n                            Thailand\n                        "                                    
[219] "\n                            \n                            Tajikistan\n                        "                                  
[220] "\n                            \n                            Tokelau\n                        "                                     
[221] "\n                            \n                            East Timor\n                        "                                  
[222] "\n                            \n                            Turkmenistan\n                        "                                
[223] "\n                            \n                            Tunisia\n                        "                                     
[224] "\n                            \n                            Tonga\n                        "                                       
[225] "\n                            \n                            Turkey\n                        "                                      
[226] "\n                            \n                            Trinidad and Tobago\n                        "                         
[227] "\n                            \n                            Tuvalu\n                        "                                      
[228] "\n                            \n                            Taiwan\n                        "                                      
[229] "\n                            \n                            Tanzania\n                        "                                    
[230] "\n                            \n                            Ukraine\n                        "                                     
[231] "\n                            \n                            Uganda\n                        "                                      
[232] "\n                            \n                            U.S. Minor Outlying Islands\n                        "                 
[233] "\n                            \n                            United States\n                        "                               
[234] "\n                            \n                            Uruguay\n                        "                                     
[235] "\n                            \n                            Uzbekistan\n                        "                                  
[236] "\n                            \n                            Vatican City\n                        "                                
[237] "\n                            \n                            Saint Vincent and the Grenadines\n                        "            
[238] "\n                            \n                            Venezuela\n                        "                                   
[239] "\n                            \n                            British Virgin Islands\n                        "                      
[240] "\n                            \n                            U.S. Virgin Islands\n                        "                         
[241] "\n                            \n                            Vietnam\n                        "                                     
[242] "\n                            \n                            Vanuatu\n                        "                                     
[243] "\n                            \n                            Wallis and Futuna\n                        "                           
[244] "\n                            \n                            Samoa\n                        "                                       
[245] "\n                            \n                            Kosovo\n                        "                                      
[246] "\n                            \n                            Yemen\n                        "                                       
[247] "\n                            \n                            Mayotte\n                        "                                     
[248] "\n                            \n                            South Africa\n                        "                                
[249] "\n                            \n                            Zambia\n                        "                                      
[250] "\n                            \n                            Zimbabwe\n                        "                                    

26.7 Same thing without pipes

###########################################################################
#
# same thing a slightly different way (without magrittr pipes)
#
###########################################################################

url = "https://scrapethissite.com/pages/simple/"
cssSelector = ".country-name"

whole_html_page = read_html(url)
country_name_html =  html_elements( whole_html_page, cssSelector )
just_the_text =   html_text(country_name_html)

#...........................
# Let's examine each part 
#...........................

# the contents of the entire HTML page
whole_html_page      
{html_document}
<html lang="en">
 [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF- ...
 [2] <body>\n    <nav id="site-nav"><div class="container">\n                 ...
 [3] <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery. ...
 [4] <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstra ...
 [5] <script src="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotif ...
 [6] <link href="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify ...
 [7] <script type="text/javascript">\n    \n    PNotify.prototype.options.sty ...
 [8] <script type="text/javascript">\n    $("video").hover(function() {\n     ...
 [9] <script>\n    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r] ...
[10] <script>\n  !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){ ...
[11] <noscript><img height="1" width="1" style="display:none" src="https://ww ...
[12] <script type="text/javascript">\n    /* <![CDATA[ */\n    var google_con ...
[13] <script type="text/javascript" src="//www.googleadservices.com/pagead/co ...
[14] <noscript>\n    <div style="display:inline;">\n    <img height="1" width ...
[15] <script async src="https://www.googletagmanager.com/gtag/js?id=AW-950945 ...
[16] <script>\n   window.dataLayer = window.dataLayer || [];\n   function gta ...
# just the HTML tags that we targeted with the cssSelector
country_name_html    
{xml_nodeset (250)}
 [1] <h3 class="country-name">\n                            <i class="flag-ic ...
 [2] <h3 class="country-name">\n                            <i class="flag-ic ...
 [3] <h3 class="country-name">\n                            <i class="flag-ic ...
 [4] <h3 class="country-name">\n                            <i class="flag-ic ...
 [5] <h3 class="country-name">\n                            <i class="flag-ic ...
 [6] <h3 class="country-name">\n                            <i class="flag-ic ...
 [7] <h3 class="country-name">\n                            <i class="flag-ic ...
 [8] <h3 class="country-name">\n                            <i class="flag-ic ...
 [9] <h3 class="country-name">\n                            <i class="flag-ic ...
[10] <h3 class="country-name">\n                            <i class="flag-ic ...
[11] <h3 class="country-name">\n                            <i class="flag-ic ...
[12] <h3 class="country-name">\n                            <i class="flag-ic ...
[13] <h3 class="country-name">\n                            <i class="flag-ic ...
[14] <h3 class="country-name">\n                            <i class="flag-ic ...
[15] <h3 class="country-name">\n                            <i class="flag-ic ...
[16] <h3 class="country-name">\n                            <i class="flag-ic ...
[17] <h3 class="country-name">\n                            <i class="flag-ic ...
[18] <h3 class="country-name">\n                            <i class="flag-ic ...
[19] <h3 class="country-name">\n                            <i class="flag-ic ...
[20] <h3 class="country-name">\n                            <i class="flag-ic ...
...
# remove the actual start and end tags ... leaving just the text.
# again ... notice that this picked up the newlines (\n) and blanks from the .html file
just_the_text  
  [1] "\n                            \n                            Andorra\n                        "                                     
  [2] "\n                            \n                            United Arab Emirates\n                        "                        
  [3] "\n                            \n                            Afghanistan\n                        "                                 
  [4] "\n                            \n                            Antigua and Barbuda\n                        "                         
  [5] "\n                            \n                            Anguilla\n                        "                                    
  [6] "\n                            \n                            Albania\n                        "                                     
  [7] "\n                            \n                            Armenia\n                        "                                     
  [8] "\n                            \n                            Angola\n                        "                                      
  [9] "\n                            \n                            Antarctica\n                        "                                  
 [10] "\n                            \n                            Argentina\n                        "                                   
 [11] "\n                            \n                            American Samoa\n                        "                              
 [12] "\n                            \n                            Austria\n                        "                                     
 [13] "\n                            \n                            Australia\n                        "                                   
 [14] "\n                            \n                            Aruba\n                        "                                       
 [15] "\n                            \n                            Åland\n                        "                                       
 [16] "\n                            \n                            Azerbaijan\n                        "                                  
 [17] "\n                            \n                            Bosnia and Herzegovina\n                        "                      
 [18] "\n                            \n                            Barbados\n                        "                                    
 [19] "\n                            \n                            Bangladesh\n                        "                                  
 [20] "\n                            \n                            Belgium\n                        "                                     
 [21] "\n                            \n                            Burkina Faso\n                        "                                
 [22] "\n                            \n                            Bulgaria\n                        "                                    
 [23] "\n                            \n                            Bahrain\n                        "                                     
 [24] "\n                            \n                            Burundi\n                        "                                     
 [25] "\n                            \n                            Benin\n                        "                                       
 [26] "\n                            \n                            Saint Barthélemy\n                        "                            
 [27] "\n                            \n                            Bermuda\n                        "                                     
 [28] "\n                            \n                            Brunei\n                        "                                      
 [29] "\n                            \n                            Bolivia\n                        "                                     
 [30] "\n                            \n                            Bonaire\n                        "                                     
 [31] "\n                            \n                            Brazil\n                        "                                      
 [32] "\n                            \n                            Bahamas\n                        "                                     
 [33] "\n                            \n                            Bhutan\n                        "                                      
 [34] "\n                            \n                            Bouvet Island\n                        "                               
 [35] "\n                            \n                            Botswana\n                        "                                    
 [36] "\n                            \n                            Belarus\n                        "                                     
 [37] "\n                            \n                            Belize\n                        "                                      
 [38] "\n                            \n                            Canada\n                        "                                      
 [39] "\n                            \n                            Cocos [Keeling] Islands\n                        "                     
 [40] "\n                            \n                            Democratic Republic of the Congo\n                        "            
 [41] "\n                            \n                            Central African Republic\n                        "                    
 [42] "\n                            \n                            Republic of the Congo\n                        "                       
 [43] "\n                            \n                            Switzerland\n                        "                                 
 [44] "\n                            \n                            Ivory Coast\n                        "                                 
 [45] "\n                            \n                            Cook Islands\n                        "                                
 [46] "\n                            \n                            Chile\n                        "                                       
 [47] "\n                            \n                            Cameroon\n                        "                                    
 [48] "\n                            \n                            China\n                        "                                       
 [49] "\n                            \n                            Colombia\n                        "                                    
 [50] "\n                            \n                            Costa Rica\n                        "                                  
 [51] "\n                            \n                            Cuba\n                        "                                        
 [52] "\n                            \n                            Cape Verde\n                        "                                  
 [53] "\n                            \n                            Curacao\n                        "                                     
 [54] "\n                            \n                            Christmas Island\n                        "                            
 [55] "\n                            \n                            Cyprus\n                        "                                      
 [56] "\n                            \n                            Czech Republic\n                        "                              
 [57] "\n                            \n                            Germany\n                        "                                     
 [58] "\n                            \n                            Djibouti\n                        "                                    
 [59] "\n                            \n                            Denmark\n                        "                                     
 [60] "\n                            \n                            Dominica\n                        "                                    
 [61] "\n                            \n                            Dominican Republic\n                        "                          
 [62] "\n                            \n                            Algeria\n                        "                                     
 [63] "\n                            \n                            Ecuador\n                        "                                     
 [64] "\n                            \n                            Estonia\n                        "                                     
 [65] "\n                            \n                            Egypt\n                        "                                       
 [66] "\n                            \n                            Western Sahara\n                        "                              
 [67] "\n                            \n                            Eritrea\n                        "                                     
 [68] "\n                            \n                            Spain\n                        "                                       
 [69] "\n                            \n                            Ethiopia\n                        "                                    
 [70] "\n                            \n                            Finland\n                        "                                     
 [71] "\n                            \n                            Fiji\n                        "                                        
 [72] "\n                            \n                            Falkland Islands\n                        "                            
 [73] "\n                            \n                            Micronesia\n                        "                                  
 [74] "\n                            \n                            Faroe Islands\n                        "                               
 [75] "\n                            \n                            France\n                        "                                      
 [76] "\n                            \n                            Gabon\n                        "                                       
 [77] "\n                            \n                            United Kingdom\n                        "                              
 [78] "\n                            \n                            Grenada\n                        "                                     
 [79] "\n                            \n                            Georgia\n                        "                                     
 [80] "\n                            \n                            French Guiana\n                        "                               
 [81] "\n                            \n                            Guernsey\n                        "                                    
 [82] "\n                            \n                            Ghana\n                        "                                       
 [83] "\n                            \n                            Gibraltar\n                        "                                   
 [84] "\n                            \n                            Greenland\n                        "                                   
 [85] "\n                            \n                            Gambia\n                        "                                      
 [86] "\n                            \n                            Guinea\n                        "                                      
 [87] "\n                            \n                            Guadeloupe\n                        "                                  
 [88] "\n                            \n                            Equatorial Guinea\n                        "                           
 [89] "\n                            \n                            Greece\n                        "                                      
 [90] "\n                            \n                            South Georgia and the South Sandwich Islands\n                        "
 [91] "\n                            \n                            Guatemala\n                        "                                   
 [92] "\n                            \n                            Guam\n                        "                                        
 [93] "\n                            \n                            Guinea-Bissau\n                        "                               
 [94] "\n                            \n                            Guyana\n                        "                                      
 [95] "\n                            \n                            Hong Kong\n                        "                                   
 [96] "\n                            \n                            Heard Island and McDonald Islands\n                        "           
 [97] "\n                            \n                            Honduras\n                        "                                    
 [98] "\n                            \n                            Croatia\n                        "                                     
 [99] "\n                            \n                            Haiti\n                        "                                       
[100] "\n                            \n                            Hungary\n                        "                                     
[101] "\n                            \n                            Indonesia\n                        "                                   
[102] "\n                            \n                            Ireland\n                        "                                     
[103] "\n                            \n                            Israel\n                        "                                      
[104] "\n                            \n                            Isle of Man\n                        "                                 
[105] "\n                            \n                            India\n                        "                                       
[106] "\n                            \n                            British Indian Ocean Territory\n                        "              
[107] "\n                            \n                            Iraq\n                        "                                        
[108] "\n                            \n                            Iran\n                        "                                        
[109] "\n                            \n                            Iceland\n                        "                                     
[110] "\n                            \n                            Italy\n                        "                                       
[111] "\n                            \n                            Jersey\n                        "                                      
[112] "\n                            \n                            Jamaica\n                        "                                     
[113] "\n                            \n                            Jordan\n                        "                                      
[114] "\n                            \n                            Japan\n                        "                                       
[115] "\n                            \n                            Kenya\n                        "                                       
[116] "\n                            \n                            Kyrgyzstan\n                        "                                  
[117] "\n                            \n                            Cambodia\n                        "                                    
[118] "\n                            \n                            Kiribati\n                        "                                    
[119] "\n                            \n                            Comoros\n                        "                                     
[120] "\n                            \n                            Saint Kitts and Nevis\n                        "                       
[121] "\n                            \n                            North Korea\n                        "                                 
[122] "\n                            \n                            South Korea\n                        "                                 
[123] "\n                            \n                            Kuwait\n                        "                                      
[124] "\n                            \n                            Cayman Islands\n                        "                              
[125] "\n                            \n                            Kazakhstan\n                        "                                  
[126] "\n                            \n                            Laos\n                        "                                        
[127] "\n                            \n                            Lebanon\n                        "                                     
[128] "\n                            \n                            Saint Lucia\n                        "                                 
[129] "\n                            \n                            Liechtenstein\n                        "                               
[130] "\n                            \n                            Sri Lanka\n                        "                                   
[131] "\n                            \n                            Liberia\n                        "                                     
[132] "\n                            \n                            Lesotho\n                        "                                     
[133] "\n                            \n                            Lithuania\n                        "                                   
[134] "\n                            \n                            Luxembourg\n                        "                                  
[135] "\n                            \n                            Latvia\n                        "                                      
[136] "\n                            \n                            Libya\n                        "                                       
[137] "\n                            \n                            Morocco\n                        "                                     
[138] "\n                            \n                            Monaco\n                        "                                      
[139] "\n                            \n                            Moldova\n                        "                                     
[140] "\n                            \n                            Montenegro\n                        "                                  
[141] "\n                            \n                            Saint Martin\n                        "                                
[142] "\n                            \n                            Madagascar\n                        "                                  
[143] "\n                            \n                            Marshall Islands\n                        "                            
[144] "\n                            \n                            Macedonia\n                        "                                   
[145] "\n                            \n                            Mali\n                        "                                        
[146] "\n                            \n                            Myanmar [Burma]\n                        "                             
[147] "\n                            \n                            Mongolia\n                        "                                    
[148] "\n                            \n                            Macao\n                        "                                       
[149] "\n                            \n                            Northern Mariana Islands\n                        "                    
[150] "\n                            \n                            Martinique\n                        "                                  
[151] "\n                            \n                            Mauritania\n                        "                                  
[152] "\n                            \n                            Montserrat\n                        "                                  
[153] "\n                            \n                            Malta\n                        "                                       
[154] "\n                            \n                            Mauritius\n                        "                                   
[155] "\n                            \n                            Maldives\n                        "                                    
[156] "\n                            \n                            Malawi\n                        "                                      
[157] "\n                            \n                            Mexico\n                        "                                      
[158] "\n                            \n                            Malaysia\n                        "                                    
[159] "\n                            \n                            Mozambique\n                        "                                  
[160] "\n                            \n                            Namibia\n                        "                                     
[161] "\n                            \n                            New Caledonia\n                        "                               
[162] "\n                            \n                            Niger\n                        "                                       
[163] "\n                            \n                            Norfolk Island\n                        "                              
[164] "\n                            \n                            Nigeria\n                        "                                     
[165] "\n                            \n                            Nicaragua\n                        "                                   
[166] "\n                            \n                            Netherlands\n                        "                                 
[167] "\n                            \n                            Norway\n                        "                                      
[168] "\n                            \n                            Nepal\n                        "                                       
[169] "\n                            \n                            Nauru\n                        "                                       
[170] "\n                            \n                            Niue\n                        "                                        
[171] "\n                            \n                            New Zealand\n                        "                                 
[172] "\n                            \n                            Oman\n                        "                                        
[173] "\n                            \n                            Panama\n                        "                                      
[174] "\n                            \n                            Peru\n                        "                                        
[175] "\n                            \n                            French Polynesia\n                        "                            
[176] "\n                            \n                            Papua New Guinea\n                        "                            
[177] "\n                            \n                            Philippines\n                        "                                 
[178] "\n                            \n                            Pakistan\n                        "                                    
[179] "\n                            \n                            Poland\n                        "                                      
[180] "\n                            \n                            Saint Pierre and Miquelon\n                        "                   
[181] "\n                            \n                            Pitcairn Islands\n                        "                            
[182] "\n                            \n                            Puerto Rico\n                        "                                 
[183] "\n                            \n                            Palestine\n                        "                                   
[184] "\n                            \n                            Portugal\n                        "                                    
[185] "\n                            \n                            Palau\n                        "                                       
[186] "\n                            \n                            Paraguay\n                        "                                    
[187] "\n                            \n                            Qatar\n                        "                                       
[188] "\n                            \n                            Réunion\n                        "                                     
[189] "\n                            \n                            Romania\n                        "                                     
[190] "\n                            \n                            Serbia\n                        "                                      
[191] "\n                            \n                            Russia\n                        "                                      
[192] "\n                            \n                            Rwanda\n                        "                                      
[193] "\n                            \n                            Saudi Arabia\n                        "                                
[194] "\n                            \n                            Solomon Islands\n                        "                             
[195] "\n                            \n                            Seychelles\n                        "                                  
[196] "\n                            \n                            Sudan\n                        "                                       
[197] "\n                            \n                            Sweden\n                        "                                      
[198] "\n                            \n                            Singapore\n                        "                                   
[199] "\n                            \n                            Saint Helena\n                        "                                
[200] "\n                            \n                            Slovenia\n                        "                                    
[201] "\n                            \n                            Svalbard and Jan Mayen\n                        "                      
[202] "\n                            \n                            Slovakia\n                        "                                    
[203] "\n                            \n                            Sierra Leone\n                        "                                
[204] "\n                            \n                            San Marino\n                        "                                  
[205] "\n                            \n                            Senegal\n                        "                                     
[206] "\n                            \n                            Somalia\n                        "                                     
[207] "\n                            \n                            Suriname\n                        "                                    
[208] "\n                            \n                            South Sudan\n                        "                                 
[209] "\n                            \n                            São Tomé and Príncipe\n                        "                       
[210] "\n                            \n                            El Salvador\n                        "                                 
[211] "\n                            \n                            Sint Maarten\n                        "                                
[212] "\n                            \n                            Syria\n                        "                                       
[213] "\n                            \n                            Swaziland\n                        "                                   
[214] "\n                            \n                            Turks and Caicos Islands\n                        "                    
[215] "\n                            \n                            Chad\n                        "                                        
[216] "\n                            \n                            French Southern Territories\n                        "                 
[217] "\n                            \n                            Togo\n                        "                                        
[218] "\n                            \n                            Thailand\n                        "                                    
[219] "\n                            \n                            Tajikistan\n                        "                                  
[220] "\n                            \n                            Tokelau\n                        "                                     
[221] "\n                            \n                            East Timor\n                        "                                  
[222] "\n                            \n                            Turkmenistan\n                        "                                
[223] "\n                            \n                            Tunisia\n                        "                                     
[224] "\n                            \n                            Tonga\n                        "                                       
[225] "\n                            \n                            Turkey\n                        "                                      
[226] "\n                            \n                            Trinidad and Tobago\n                        "                         
[227] "\n                            \n                            Tuvalu\n                        "                                      
[228] "\n                            \n                            Taiwan\n                        "                                      
[229] "\n                            \n                            Tanzania\n                        "                                    
[230] "\n                            \n                            Ukraine\n                        "                                     
[231] "\n                            \n                            Uganda\n                        "                                      
[232] "\n                            \n                            U.S. Minor Outlying Islands\n                        "                 
[233] "\n                            \n                            United States\n                        "                               
[234] "\n                            \n                            Uruguay\n                        "                                     
[235] "\n                            \n                            Uzbekistan\n                        "                                  
[236] "\n                            \n                            Vatican City\n                        "                                
[237] "\n                            \n                            Saint Vincent and the Grenadines\n                        "            
[238] "\n                            \n                            Venezuela\n                        "                                   
[239] "\n                            \n                            British Virgin Islands\n                        "                      
[240] "\n                            \n                            U.S. Virgin Islands\n                        "                         
[241] "\n                            \n                            Vietnam\n                        "                                     
[242] "\n                            \n                            Vanuatu\n                        "                                     
[243] "\n                            \n                            Wallis and Futuna\n                        "                           
[244] "\n                            \n                            Samoa\n                        "                                       
[245] "\n                            \n                            Kosovo\n                        "                                      
[246] "\n                            \n                            Yemen\n                        "                                       
[247] "\n                            \n                            Mayotte\n                        "                                     
[248] "\n                            \n                            South Africa\n                        "                                
[249] "\n                            \n                            Zambia\n                        "                                      
[250] "\n                            \n                            Zimbabwe\n                        "                                    
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Let's look a little closer at the contents of what is returned by read_html
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

# We already had this above, just reviewing it again ...
url = "https://scrapethissite.com/pages/simple/"
whole_html_page = read_html(url)
country_name_html =  html_elements( whole_html_page, cssSelector )
just_the_text =   html_text(country_name_html)

# What is the structure of this data?
str(whole_html_page) # list of 2 externalptr objects (see below)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
# What is the class of this data? 
class(whole_html_page)   # "xml_document" "xml_node"
[1] "xml_document" "xml_node"    
methods(print)    # ... print.xml_document* ... print.xml_node*
  [1] print.acf*                                          
  [2] print.activeConcordance*                            
  [3] print.AES*                                          
  [4] print.anova*                                        
  [5] print.aov*                                          
  [6] print.aovlist*                                      
  [7] print.ar*                                           
  [8] print.Arima*                                        
  [9] print.arima0*                                       
 [10] print.AsIs                                          
 [11] print.aspell*                                       
 [12] print.aspell_inspect_context*                       
 [13] print.bibentry*                                     
 [14] print.Bibtex*                                       
 [15] print.browseVignettes*                              
 [16] print.by                                            
 [17] print.cache_info*                                   
 [18] print.changedFiles*                                 
 [19] print.check_bogus_return*                           
 [20] print.check_code_usage_in_package*                  
 [21] print.check_compiled_code*                          
 [22] print.check_demo_index*                             
 [23] print.check_depdef*                                 
 [24] print.check_details*                                
 [25] print.check_details_changes*                        
 [26] print.check_doi_db*                                 
 [27] print.check_dotInternal*                            
 [28] print.check_make_vars*                              
 [29] print.check_nonAPI_calls*                           
 [30] print.check_package_code_assign_to_globalenv*       
 [31] print.check_package_code_attach*                    
 [32] print.check_package_code_data_into_globalenv*       
 [33] print.check_package_code_startup_functions*         
 [34] print.check_package_code_syntax*                    
 [35] print.check_package_code_unload_functions*          
 [36] print.check_package_compact_datasets*               
 [37] print.check_package_CRAN_incoming*                  
 [38] print.check_package_datalist*                       
 [39] print.check_package_datasets*                       
 [40] print.check_package_depends*                        
 [41] print.check_package_description*                    
 [42] print.check_package_description_encoding*           
 [43] print.check_package_license*                        
 [44] print.check_packages_in_dir*                        
 [45] print.check_packages_used*                          
 [46] print.check_po_files*                               
 [47] print.check_pragmas*                                
 [48] print.check_Rd_line_widths*                         
 [49] print.check_Rd_metadata*                            
 [50] print.check_Rd_xrefs*                               
 [51] print.check_RegSym_calls*                           
 [52] print.check_S3_methods_needing_delayed_registration*
 [53] print.check_so_symbols*                             
 [54] print.check_T_and_F*                                
 [55] print.check_url_db*                                 
 [56] print.check_vignette_index*                         
 [57] print.checkDocFiles*                                
 [58] print.checkDocStyle*                                
 [59] print.checkFF*                                      
 [60] print.checkRd*                                      
 [61] print.checkRdContents*                              
 [62] print.checkReplaceFuns*                             
 [63] print.checkS3methods*                               
 [64] print.checkTnF*                                     
 [65] print.checkVignettes*                               
 [66] print.citation*                                     
 [67] print.cli_ansi_html_style*                          
 [68] print.cli_ansi_string*                              
 [69] print.cli_ansi_style*                               
 [70] print.cli_boxx*                                     
 [71] print.cli_diff_chr*                                 
 [72] print.cli_doc*                                      
 [73] print.cli_progress_demo*                            
 [74] print.cli_rule*                                     
 [75] print.cli_sitrep*                                   
 [76] print.cli_spark*                                    
 [77] print.cli_spinner*                                  
 [78] print.cli_tree*                                     
 [79] print.codoc*                                        
 [80] print.codocClasses*                                 
 [81] print.codocData*                                    
 [82] print.colorConverter*                               
 [83] print.compactPDF*                                   
 [84] print.condition                                     
 [85] print.connection                                    
 [86] print.CRAN_package_reverse_dependencies_and_views*  
 [87] print.curl_handle*                                  
 [88] print.curl_multi*                                   
 [89] print.data.frame                                    
 [90] print.Date                                          
 [91] print.default                                       
 [92] print.dendrogram*                                   
 [93] print.density*                                      
 [94] print.difftime                                      
 [95] print.dist*                                         
 [96] print.Dlist                                         
 [97] print.DLLInfo                                       
 [98] print.DLLInfoList                                   
 [99] print.DLLRegisteredRoutines                         
[100] print.document_context*                             
[101] print.document_position*                            
[102] print.document_range*                               
[103] print.document_selection*                           
[104] print.dummy_coef*                                   
[105] print.dummy_coef_list*                              
[106] print.ecdf*                                         
[107] print.eigen                                         
[108] print.factanal*                                     
[109] print.factor                                        
[110] print.family*                                       
[111] print.fileSnapshot*                                 
[112] print.findLineNumResult*                            
[113] print.form_data*                                    
[114] print.form_file*                                    
[115] print.formula*                                      
[116] print.fseq*                                         
[117] print.ftable*                                       
[118] print.function                                      
[119] print.getAnywhere*                                  
[120] print.glm*                                          
[121] print.glue*                                         
[122] print.handle*                                       
[123] print.hashtab*                                      
[124] print.hclust*                                       
[125] print.help_files_with_topic*                        
[126] print.hexmode                                       
[127] print.HoltWinters*                                  
[128] print.hsearch*                                      
[129] print.hsearch_db*                                   
[130] print.htest*                                        
[131] print.html*                                         
[132] print.html_dependency*                              
[133] print.htmltools.selector*                           
[134] print.htmltools.selector.list*                      
[135] print.htmlwidget*                                   
[136] print.infl*                                         
[137] print.integrate*                                    
[138] print.isoreg*                                       
[139] print.json*                                         
[140] print.key_missing*                                  
[141] print.kmeans*                                       
[142] print.knitr_kable*                                  
[143] print.Latex*                                        
[144] print.LaTeX*                                        
[145] print.libraryIQR                                    
[146] print.lifecycle_warnings*                           
[147] print.listof                                        
[148] print.lm*                                           
[149] print.loadings*                                     
[150] print.loess*                                        
[151] print.logLik*                                       
[152] print.ls_str*                                       
[153] print.medpolish*                                    
[154] print.MethodsFunction*                              
[155] print.mtable*                                       
[156] print.NativeRoutineList                             
[157] print.news_db*                                      
[158] print.nls*                                          
[159] print.noquote                                       
[160] print.numeric_version                               
[161] print.oauth_app*                                    
[162] print.oauth_endpoint*                               
[163] print.object_size*                                  
[164] print.octmode                                       
[165] print.opts_list*                                    
[166] print.packageDescription*                           
[167] print.packageInfo                                   
[168] print.packageIQR*                                   
[169] print.packageStatus*                                
[170] print.pairwise.htest*                               
[171] print.person*                                       
[172] print.POSIXct                                       
[173] print.POSIXlt                                       
[174] print.power.htest*                                  
[175] print.ppr*                                          
[176] print.prcomp*                                       
[177] print.princomp*                                     
[178] print.proc_time                                     
[179] print.quosure*                                      
[180] print.quosures*                                     
[181] print.R6*                                           
[182] print.R6ClassGenerator*                             
[183] print.raster*                                       
[184] print.Rconcordance*                                 
[185] print.Rd*                                           
[186] print.recordedplot*                                 
[187] print.request*                                      
[188] print.response*                                     
[189] print.restart                                       
[190] print.RGBcolorConverter*                            
[191] print.RGlyphFont*                                   
[192] print.rlang:::list_of_conditions*                   
[193] print.rlang_box_done*                               
[194] print.rlang_box_splice*                             
[195] print.rlang_data_pronoun*                           
[196] print.rlang_dict*                                   
[197] print.rlang_dyn_array*                              
[198] print.rlang_envs*                                   
[199] print.rlang_error*                                  
[200] print.rlang_fake_data_pronoun*                      
[201] print.rlang_lambda_function*                        
[202] print.rlang_message*                                
[203] print.rlang_trace*                                  
[204] print.rlang_warning*                                
[205] print.rlang_zap*                                    
[206] print.rle                                           
[207] print.rlib_bytes*                                   
[208] print.rlib_error_3_0*                               
[209] print.rlib_trace_3_0*                               
[210] print.roman*                                        
[211] print.rvest_field*                                  
[212] print.rvest_form*                                   
[213] print.rvest_session*                                
[214] print.SavedPlots*                                   
[215] print.scalar*                                       
[216] print.sessionInfo*                                  
[217] print.shiny.tag*                                    
[218] print.shiny.tag.env*                                
[219] print.shiny.tag.list*                               
[220] print.shiny.tag.query*                              
[221] print.simple.list                                   
[222] print.smooth.spline*                                
[223] print.socket*                                       
[224] print.srcfile                                       
[225] print.srcref                                        
[226] print.stepfun*                                      
[227] print.stl*                                          
[228] print.stringr_view*                                 
[229] print.StructTS*                                     
[230] print.subdir_tests*                                 
[231] print.summarize_CRAN_check_status*                  
[232] print.summary.aov*                                  
[233] print.summary.aovlist*                              
[234] print.summary.ecdf*                                 
[235] print.summary.glm*                                  
[236] print.summary.lm*                                   
[237] print.summary.loess*                                
[238] print.summary.manova*                               
[239] print.summary.nls*                                  
[240] print.summary.packageStatus*                        
[241] print.summary.ppr*                                  
[242] print.summary.prcomp*                               
[243] print.summary.princomp*                             
[244] print.summary.table                                 
[245] print.summary.warnings                              
[246] print.summaryDefault                                
[247] print.suppress_viewer*                              
[248] print.table                                         
[249] print.tables_aov*                                   
[250] print.terms*                                        
[251] print.ts*                                           
[252] print.tskernel*                                     
[253] print.TukeyHSD*                                     
[254] print.tukeyline*                                    
[255] print.tukeysmooth*                                  
[256] print.undoc*                                        
[257] print.vctrs_bytes*                                  
[258] print.vctrs_sclr*                                   
[259] print.vctrs_unspecified*                            
[260] print.vctrs_vctr*                                   
[261] print.vignette*                                     
[262] print.warnings                                      
[263] print.xfun_raw_string*                              
[264] print.xfun_rename_seq*                              
[265] print.xfun_strict_list*                             
[266] print.xgettext*                                     
[267] print.xml_document*                                 
[268] print.xml_missing*                                  
[269] print.xml_namespace*                                
[270] print.xml_node*                                     
[271] print.xml_nodeset*                                  
[272] print.xngettext*                                    
[273] print.xtabs*                                        
see '?methods' for accessing help and source code
attributes(whole_html_page)
$names
[1] "node" "doc" 

$class
[1] "xml_document" "xml_node"    
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# An externalptr cannot be accessed directly via R code. It is a "blob"
# (a "binary large object") i.e. a piece of data that is only accessible
# via C language (not R) code that is used to build some of the packages
# that are used to extend R's functionality. You must use the built in
# functions in the xml2 and the rvest packages to access this data.
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

26.8 Working with the data you scraped

#.....................................................................
#
# we can eliminate extra newlines and the blanks with the gsub function
# 
#.....................................................................

?gsub   # see the documentation
starting httpd help server ... done
without_backslash_n = gsub("\\n", "", just_the_text)    
without_backslash_n
  [1] "                                                        Andorra                        "                                     
  [2] "                                                        United Arab Emirates                        "                        
  [3] "                                                        Afghanistan                        "                                 
  [4] "                                                        Antigua and Barbuda                        "                         
  [5] "                                                        Anguilla                        "                                    
  [6] "                                                        Albania                        "                                     
  [7] "                                                        Armenia                        "                                     
  [8] "                                                        Angola                        "                                      
  [9] "                                                        Antarctica                        "                                  
 [10] "                                                        Argentina                        "                                   
 [11] "                                                        American Samoa                        "                              
 [12] "                                                        Austria                        "                                     
 [13] "                                                        Australia                        "                                   
 [14] "                                                        Aruba                        "                                       
 [15] "                                                        Åland                        "                                       
 [16] "                                                        Azerbaijan                        "                                  
 [17] "                                                        Bosnia and Herzegovina                        "                      
 [18] "                                                        Barbados                        "                                    
 [19] "                                                        Bangladesh                        "                                  
 [20] "                                                        Belgium                        "                                     
 [21] "                                                        Burkina Faso                        "                                
 [22] "                                                        Bulgaria                        "                                    
 [23] "                                                        Bahrain                        "                                     
 [24] "                                                        Burundi                        "                                     
 [25] "                                                        Benin                        "                                       
 [26] "                                                        Saint Barthélemy                        "                            
 [27] "                                                        Bermuda                        "                                     
 [28] "                                                        Brunei                        "                                      
 [29] "                                                        Bolivia                        "                                     
 [30] "                                                        Bonaire                        "                                     
 [31] "                                                        Brazil                        "                                      
 [32] "                                                        Bahamas                        "                                     
 [33] "                                                        Bhutan                        "                                      
 [34] "                                                        Bouvet Island                        "                               
 [35] "                                                        Botswana                        "                                    
 [36] "                                                        Belarus                        "                                     
 [37] "                                                        Belize                        "                                      
 [38] "                                                        Canada                        "                                      
 [39] "                                                        Cocos [Keeling] Islands                        "                     
 [40] "                                                        Democratic Republic of the Congo                        "            
 [41] "                                                        Central African Republic                        "                    
 [42] "                                                        Republic of the Congo                        "                       
 [43] "                                                        Switzerland                        "                                 
 [44] "                                                        Ivory Coast                        "                                 
 [45] "                                                        Cook Islands                        "                                
 [46] "                                                        Chile                        "                                       
 [47] "                                                        Cameroon                        "                                    
 [48] "                                                        China                        "                                       
 [49] "                                                        Colombia                        "                                    
 [50] "                                                        Costa Rica                        "                                  
 [51] "                                                        Cuba                        "                                        
 [52] "                                                        Cape Verde                        "                                  
 [53] "                                                        Curacao                        "                                     
 [54] "                                                        Christmas Island                        "                            
 [55] "                                                        Cyprus                        "                                      
 [56] "                                                        Czech Republic                        "                              
 [57] "                                                        Germany                        "                                     
 [58] "                                                        Djibouti                        "                                    
 [59] "                                                        Denmark                        "                                     
 [60] "                                                        Dominica                        "                                    
 [61] "                                                        Dominican Republic                        "                          
 [62] "                                                        Algeria                        "                                     
 [63] "                                                        Ecuador                        "                                     
 [64] "                                                        Estonia                        "                                     
 [65] "                                                        Egypt                        "                                       
 [66] "                                                        Western Sahara                        "                              
 [67] "                                                        Eritrea                        "                                     
 [68] "                                                        Spain                        "                                       
 [69] "                                                        Ethiopia                        "                                    
 [70] "                                                        Finland                        "                                     
 [71] "                                                        Fiji                        "                                        
 [72] "                                                        Falkland Islands                        "                            
 [73] "                                                        Micronesia                        "                                  
 [74] "                                                        Faroe Islands                        "                               
 [75] "                                                        France                        "                                      
 [76] "                                                        Gabon                        "                                       
 [77] "                                                        United Kingdom                        "                              
 [78] "                                                        Grenada                        "                                     
 [79] "                                                        Georgia                        "                                     
 [80] "                                                        French Guiana                        "                               
 [81] "                                                        Guernsey                        "                                    
 [82] "                                                        Ghana                        "                                       
 [83] "                                                        Gibraltar                        "                                   
 [84] "                                                        Greenland                        "                                   
 [85] "                                                        Gambia                        "                                      
 [86] "                                                        Guinea                        "                                      
 [87] "                                                        Guadeloupe                        "                                  
 [88] "                                                        Equatorial Guinea                        "                           
 [89] "                                                        Greece                        "                                      
 [90] "                                                        South Georgia and the South Sandwich Islands                        "
 [91] "                                                        Guatemala                        "                                   
 [92] "                                                        Guam                        "                                        
 [93] "                                                        Guinea-Bissau                        "                               
 [94] "                                                        Guyana                        "                                      
 [95] "                                                        Hong Kong                        "                                   
 [96] "                                                        Heard Island and McDonald Islands                        "           
 [97] "                                                        Honduras                        "                                    
 [98] "                                                        Croatia                        "                                     
 [99] "                                                        Haiti                        "                                       
[100] "                                                        Hungary                        "                                     
[101] "                                                        Indonesia                        "                                   
[102] "                                                        Ireland                        "                                     
[103] "                                                        Israel                        "                                      
[104] "                                                        Isle of Man                        "                                 
[105] "                                                        India                        "                                       
[106] "                                                        British Indian Ocean Territory                        "              
[107] "                                                        Iraq                        "                                        
[108] "                                                        Iran                        "                                        
[109] "                                                        Iceland                        "                                     
[110] "                                                        Italy                        "                                       
[111] "                                                        Jersey                        "                                      
[112] "                                                        Jamaica                        "                                     
[113] "                                                        Jordan                        "                                      
[114] "                                                        Japan                        "                                       
[115] "                                                        Kenya                        "                                       
[116] "                                                        Kyrgyzstan                        "                                  
[117] "                                                        Cambodia                        "                                    
[118] "                                                        Kiribati                        "                                    
[119] "                                                        Comoros                        "                                     
[120] "                                                        Saint Kitts and Nevis                        "                       
[121] "                                                        North Korea                        "                                 
[122] "                                                        South Korea                        "                                 
[123] "                                                        Kuwait                        "                                      
[124] "                                                        Cayman Islands                        "                              
[125] "                                                        Kazakhstan                        "                                  
[126] "                                                        Laos                        "                                        
[127] "                                                        Lebanon                        "                                     
[128] "                                                        Saint Lucia                        "                                 
[129] "                                                        Liechtenstein                        "                               
[130] "                                                        Sri Lanka                        "                                   
[131] "                                                        Liberia                        "                                     
[132] "                                                        Lesotho                        "                                     
[133] "                                                        Lithuania                        "                                   
[134] "                                                        Luxembourg                        "                                  
[135] "                                                        Latvia                        "                                      
[136] "                                                        Libya                        "                                       
[137] "                                                        Morocco                        "                                     
[138] "                                                        Monaco                        "                                      
[139] "                                                        Moldova                        "                                     
[140] "                                                        Montenegro                        "                                  
[141] "                                                        Saint Martin                        "                                
[142] "                                                        Madagascar                        "                                  
[143] "                                                        Marshall Islands                        "                            
[144] "                                                        Macedonia                        "                                   
[145] "                                                        Mali                        "                                        
[146] "                                                        Myanmar [Burma]                        "                             
[147] "                                                        Mongolia                        "                                    
[148] "                                                        Macao                        "                                       
[149] "                                                        Northern Mariana Islands                        "                    
[150] "                                                        Martinique                        "                                  
[151] "                                                        Mauritania                        "                                  
[152] "                                                        Montserrat                        "                                  
[153] "                                                        Malta                        "                                       
[154] "                                                        Mauritius                        "                                   
[155] "                                                        Maldives                        "                                    
[156] "                                                        Malawi                        "                                      
[157] "                                                        Mexico                        "                                      
[158] "                                                        Malaysia                        "                                    
[159] "                                                        Mozambique                        "                                  
[160] "                                                        Namibia                        "                                     
[161] "                                                        New Caledonia                        "                               
[162] "                                                        Niger                        "                                       
[163] "                                                        Norfolk Island                        "                              
[164] "                                                        Nigeria                        "                                     
[165] "                                                        Nicaragua                        "                                   
[166] "                                                        Netherlands                        "                                 
[167] "                                                        Norway                        "                                      
[168] "                                                        Nepal                        "                                       
[169] "                                                        Nauru                        "                                       
[170] "                                                        Niue                        "                                        
[171] "                                                        New Zealand                        "                                 
[172] "                                                        Oman                        "                                        
[173] "                                                        Panama                        "                                      
[174] "                                                        Peru                        "                                        
[175] "                                                        French Polynesia                        "                            
[176] "                                                        Papua New Guinea                        "                            
[177] "                                                        Philippines                        "                                 
[178] "                                                        Pakistan                        "                                    
[179] "                                                        Poland                        "                                      
[180] "                                                        Saint Pierre and Miquelon                        "                   
[181] "                                                        Pitcairn Islands                        "                            
[182] "                                                        Puerto Rico                        "                                 
[183] "                                                        Palestine                        "                                   
[184] "                                                        Portugal                        "                                    
[185] "                                                        Palau                        "                                       
[186] "                                                        Paraguay                        "                                    
[187] "                                                        Qatar                        "                                       
[188] "                                                        Réunion                        "                                     
[189] "                                                        Romania                        "                                     
[190] "                                                        Serbia                        "                                      
[191] "                                                        Russia                        "                                      
[192] "                                                        Rwanda                        "                                      
[193] "                                                        Saudi Arabia                        "                                
[194] "                                                        Solomon Islands                        "                             
[195] "                                                        Seychelles                        "                                  
[196] "                                                        Sudan                        "                                       
[197] "                                                        Sweden                        "                                      
[198] "                                                        Singapore                        "                                   
[199] "                                                        Saint Helena                        "                                
[200] "                                                        Slovenia                        "                                    
[201] "                                                        Svalbard and Jan Mayen                        "                      
[202] "                                                        Slovakia                        "                                    
[203] "                                                        Sierra Leone                        "                                
[204] "                                                        San Marino                        "                                  
[205] "                                                        Senegal                        "                                     
[206] "                                                        Somalia                        "                                     
[207] "                                                        Suriname                        "                                    
[208] "                                                        South Sudan                        "                                 
[209] "                                                        São Tomé and Príncipe                        "                       
[210] "                                                        El Salvador                        "                                 
[211] "                                                        Sint Maarten                        "                                
[212] "                                                        Syria                        "                                       
[213] "                                                        Swaziland                        "                                   
[214] "                                                        Turks and Caicos Islands                        "                    
[215] "                                                        Chad                        "                                        
[216] "                                                        French Southern Territories                        "                 
[217] "                                                        Togo                        "                                        
[218] "                                                        Thailand                        "                                    
[219] "                                                        Tajikistan                        "                                  
[220] "                                                        Tokelau                        "                                     
[221] "                                                        East Timor                        "                                  
[222] "                                                        Turkmenistan                        "                                
[223] "                                                        Tunisia                        "                                     
[224] "                                                        Tonga                        "                                       
[225] "                                                        Turkey                        "                                      
[226] "                                                        Trinidad and Tobago                        "                         
[227] "                                                        Tuvalu                        "                                      
[228] "                                                        Taiwan                        "                                      
[229] "                                                        Tanzania                        "                                    
[230] "                                                        Ukraine                        "                                     
[231] "                                                        Uganda                        "                                      
[232] "                                                        U.S. Minor Outlying Islands                        "                 
[233] "                                                        United States                        "                               
[234] "                                                        Uruguay                        "                                     
[235] "                                                        Uzbekistan                        "                                  
[236] "                                                        Vatican City                        "                                
[237] "                                                        Saint Vincent and the Grenadines                        "            
[238] "                                                        Venezuela                        "                                   
[239] "                                                        British Virgin Islands                        "                      
[240] "                                                        U.S. Virgin Islands                        "                         
[241] "                                                        Vietnam                        "                                     
[242] "                                                        Vanuatu                        "                                     
[243] "                                                        Wallis and Futuna                        "                           
[244] "                                                        Samoa                        "                                       
[245] "                                                        Kosovo                        "                                      
[246] "                                                        Yemen                        "                                       
[247] "                                                        Mayotte                        "                                     
[248] "                                                        South Africa                        "                                
[249] "                                                        Zambia                        "                                      
[250] "                                                        Zimbabwe                        "                                    
# The following has a bug in that it removes ALL spaces.
# countries = gsub(" " , "" , without_backslash_n)

# Just remove the spaces that appear BEFORE the country name
countries1 = gsub("^ *", "", without_backslash_n)
countries2 = gsub(" *$", "", countries1)

countries = countries2

# alternatively, you can eliminate spaces and \n characters 
# before and after the text all in one shot
countries3 = gsub("(^[ \\n]*)|([ \\n]*$)", "", without_backslash_n)
countries3
  [1] "Andorra"                                     
  [2] "United Arab Emirates"                        
  [3] "Afghanista"                                  
  [4] "Antigua and Barbuda"                         
  [5] "Anguilla"                                    
  [6] "Albania"                                     
  [7] "Armenia"                                     
  [8] "Angola"                                      
  [9] "Antarctica"                                  
 [10] "Argentina"                                   
 [11] "American Samoa"                              
 [12] "Austria"                                     
 [13] "Australia"                                   
 [14] "Aruba"                                       
 [15] "Åland"                                       
 [16] "Azerbaija"                                   
 [17] "Bosnia and Herzegovina"                      
 [18] "Barbados"                                    
 [19] "Bangladesh"                                  
 [20] "Belgium"                                     
 [21] "Burkina Faso"                                
 [22] "Bulgaria"                                    
 [23] "Bahrai"                                      
 [24] "Burundi"                                     
 [25] "Beni"                                        
 [26] "Saint Barthélemy"                            
 [27] "Bermuda"                                     
 [28] "Brunei"                                      
 [29] "Bolivia"                                     
 [30] "Bonaire"                                     
 [31] "Brazil"                                      
 [32] "Bahamas"                                     
 [33] "Bhuta"                                       
 [34] "Bouvet Island"                               
 [35] "Botswana"                                    
 [36] "Belarus"                                     
 [37] "Belize"                                      
 [38] "Canada"                                      
 [39] "Cocos [Keeling] Islands"                     
 [40] "Democratic Republic of the Congo"            
 [41] "Central African Republic"                    
 [42] "Republic of the Congo"                       
 [43] "Switzerland"                                 
 [44] "Ivory Coast"                                 
 [45] "Cook Islands"                                
 [46] "Chile"                                       
 [47] "Cameroo"                                     
 [48] "China"                                       
 [49] "Colombia"                                    
 [50] "Costa Rica"                                  
 [51] "Cuba"                                        
 [52] "Cape Verde"                                  
 [53] "Curacao"                                     
 [54] "Christmas Island"                            
 [55] "Cyprus"                                      
 [56] "Czech Republic"                              
 [57] "Germany"                                     
 [58] "Djibouti"                                    
 [59] "Denmark"                                     
 [60] "Dominica"                                    
 [61] "Dominican Republic"                          
 [62] "Algeria"                                     
 [63] "Ecuador"                                     
 [64] "Estonia"                                     
 [65] "Egypt"                                       
 [66] "Western Sahara"                              
 [67] "Eritrea"                                     
 [68] "Spai"                                        
 [69] "Ethiopia"                                    
 [70] "Finland"                                     
 [71] "Fiji"                                        
 [72] "Falkland Islands"                            
 [73] "Micronesia"                                  
 [74] "Faroe Islands"                               
 [75] "France"                                      
 [76] "Gabo"                                        
 [77] "United Kingdom"                              
 [78] "Grenada"                                     
 [79] "Georgia"                                     
 [80] "French Guiana"                               
 [81] "Guernsey"                                    
 [82] "Ghana"                                       
 [83] "Gibraltar"                                   
 [84] "Greenland"                                   
 [85] "Gambia"                                      
 [86] "Guinea"                                      
 [87] "Guadeloupe"                                  
 [88] "Equatorial Guinea"                           
 [89] "Greece"                                      
 [90] "South Georgia and the South Sandwich Islands"
 [91] "Guatemala"                                   
 [92] "Guam"                                        
 [93] "Guinea-Bissau"                               
 [94] "Guyana"                                      
 [95] "Hong Kong"                                   
 [96] "Heard Island and McDonald Islands"           
 [97] "Honduras"                                    
 [98] "Croatia"                                     
 [99] "Haiti"                                       
[100] "Hungary"                                     
[101] "Indonesia"                                   
[102] "Ireland"                                     
[103] "Israel"                                      
[104] "Isle of Ma"                                  
[105] "India"                                       
[106] "British Indian Ocean Territory"              
[107] "Iraq"                                        
[108] "Ira"                                         
[109] "Iceland"                                     
[110] "Italy"                                       
[111] "Jersey"                                      
[112] "Jamaica"                                     
[113] "Jorda"                                       
[114] "Japa"                                        
[115] "Kenya"                                       
[116] "Kyrgyzsta"                                   
[117] "Cambodia"                                    
[118] "Kiribati"                                    
[119] "Comoros"                                     
[120] "Saint Kitts and Nevis"                       
[121] "North Korea"                                 
[122] "South Korea"                                 
[123] "Kuwait"                                      
[124] "Cayman Islands"                              
[125] "Kazakhsta"                                   
[126] "Laos"                                        
[127] "Lebano"                                      
[128] "Saint Lucia"                                 
[129] "Liechtenstei"                                
[130] "Sri Lanka"                                   
[131] "Liberia"                                     
[132] "Lesotho"                                     
[133] "Lithuania"                                   
[134] "Luxembourg"                                  
[135] "Latvia"                                      
[136] "Libya"                                       
[137] "Morocco"                                     
[138] "Monaco"                                      
[139] "Moldova"                                     
[140] "Montenegro"                                  
[141] "Saint Marti"                                 
[142] "Madagascar"                                  
[143] "Marshall Islands"                            
[144] "Macedonia"                                   
[145] "Mali"                                        
[146] "Myanmar [Burma]"                             
[147] "Mongolia"                                    
[148] "Macao"                                       
[149] "Northern Mariana Islands"                    
[150] "Martinique"                                  
[151] "Mauritania"                                  
[152] "Montserrat"                                  
[153] "Malta"                                       
[154] "Mauritius"                                   
[155] "Maldives"                                    
[156] "Malawi"                                      
[157] "Mexico"                                      
[158] "Malaysia"                                    
[159] "Mozambique"                                  
[160] "Namibia"                                     
[161] "New Caledonia"                               
[162] "Niger"                                       
[163] "Norfolk Island"                              
[164] "Nigeria"                                     
[165] "Nicaragua"                                   
[166] "Netherlands"                                 
[167] "Norway"                                      
[168] "Nepal"                                       
[169] "Nauru"                                       
[170] "Niue"                                        
[171] "New Zealand"                                 
[172] "Oma"                                         
[173] "Panama"                                      
[174] "Peru"                                        
[175] "French Polynesia"                            
[176] "Papua New Guinea"                            
[177] "Philippines"                                 
[178] "Pakista"                                     
[179] "Poland"                                      
[180] "Saint Pierre and Miquelo"                    
[181] "Pitcairn Islands"                            
[182] "Puerto Rico"                                 
[183] "Palestine"                                   
[184] "Portugal"                                    
[185] "Palau"                                       
[186] "Paraguay"                                    
[187] "Qatar"                                       
[188] "Réunio"                                      
[189] "Romania"                                     
[190] "Serbia"                                      
[191] "Russia"                                      
[192] "Rwanda"                                      
[193] "Saudi Arabia"                                
[194] "Solomon Islands"                             
[195] "Seychelles"                                  
[196] "Suda"                                        
[197] "Swede"                                       
[198] "Singapore"                                   
[199] "Saint Helena"                                
[200] "Slovenia"                                    
[201] "Svalbard and Jan Maye"                       
[202] "Slovakia"                                    
[203] "Sierra Leone"                                
[204] "San Marino"                                  
[205] "Senegal"                                     
[206] "Somalia"                                     
[207] "Suriname"                                    
[208] "South Suda"                                  
[209] "São Tomé and Príncipe"                       
[210] "El Salvador"                                 
[211] "Sint Maarte"                                 
[212] "Syria"                                       
[213] "Swaziland"                                   
[214] "Turks and Caicos Islands"                    
[215] "Chad"                                        
[216] "French Southern Territories"                 
[217] "Togo"                                        
[218] "Thailand"                                    
[219] "Tajikista"                                   
[220] "Tokelau"                                     
[221] "East Timor"                                  
[222] "Turkmenista"                                 
[223] "Tunisia"                                     
[224] "Tonga"                                       
[225] "Turkey"                                      
[226] "Trinidad and Tobago"                         
[227] "Tuvalu"                                      
[228] "Taiwa"                                       
[229] "Tanzania"                                    
[230] "Ukraine"                                     
[231] "Uganda"                                      
[232] "U.S. Minor Outlying Islands"                 
[233] "United States"                               
[234] "Uruguay"                                     
[235] "Uzbekista"                                   
[236] "Vatican City"                                
[237] "Saint Vincent and the Grenadines"            
[238] "Venezuela"                                   
[239] "British Virgin Islands"                      
[240] "U.S. Virgin Islands"                         
[241] "Vietnam"                                     
[242] "Vanuatu"                                     
[243] "Wallis and Futuna"                           
[244] "Samoa"                                       
[245] "Kosovo"                                      
[246] "Yeme"                                        
[247] "Mayotte"                                     
[248] "South Africa"                                
[249] "Zambia"                                      
[250] "Zimbabwe"                                    

26.9 Get all data into a dataframe

########################################################
#
# Now that we understand the basics we can get
# all of the country data and put it in a dataframe
#
########################################################

url = "https://scrapethissite.com/pages/simple/"

the_full_html = read_html(url)

countries = the_full_html %>%
              html_elements( ".country-name" ) %>%
              html_text()

capitals = the_full_html %>%
              html_elements( ".country-capital" ) %>%
              html_text()

population = the_full_html %>%
  html_elements( ".country-population" ) %>%
  html_text()

area = the_full_html %>%
  html_elements( ".country-area" ) %>%
  html_text()


countries
  [1] "\n                            \n                            Andorra\n                        "                                     
  [2] "\n                            \n                            United Arab Emirates\n                        "                        
  [3] "\n                            \n                            Afghanistan\n                        "                                 
  [4] "\n                            \n                            Antigua and Barbuda\n                        "                         
  [5] "\n                            \n                            Anguilla\n                        "                                    
  [6] "\n                            \n                            Albania\n                        "                                     
  [7] "\n                            \n                            Armenia\n                        "                                     
  [8] "\n                            \n                            Angola\n                        "                                      
  [9] "\n                            \n                            Antarctica\n                        "                                  
 [10] "\n                            \n                            Argentina\n                        "                                   
 [11] "\n                            \n                            American Samoa\n                        "                              
 [12] "\n                            \n                            Austria\n                        "                                     
 [13] "\n                            \n                            Australia\n                        "                                   
 [14] "\n                            \n                            Aruba\n                        "                                       
 [15] "\n                            \n                            Åland\n                        "                                       
 [16] "\n                            \n                            Azerbaijan\n                        "                                  
 [17] "\n                            \n                            Bosnia and Herzegovina\n                        "                      
 [18] "\n                            \n                            Barbados\n                        "                                    
 [19] "\n                            \n                            Bangladesh\n                        "                                  
 [20] "\n                            \n                            Belgium\n                        "                                     
 [21] "\n                            \n                            Burkina Faso\n                        "                                
 [22] "\n                            \n                            Bulgaria\n                        "                                    
 [23] "\n                            \n                            Bahrain\n                        "                                     
 [24] "\n                            \n                            Burundi\n                        "                                     
 [25] "\n                            \n                            Benin\n                        "                                       
 [26] "\n                            \n                            Saint Barthélemy\n                        "                            
 [27] "\n                            \n                            Bermuda\n                        "                                     
 [28] "\n                            \n                            Brunei\n                        "                                      
 [29] "\n                            \n                            Bolivia\n                        "                                     
 [30] "\n                            \n                            Bonaire\n                        "                                     
 [31] "\n                            \n                            Brazil\n                        "                                      
 [32] "\n                            \n                            Bahamas\n                        "                                     
 [33] "\n                            \n                            Bhutan\n                        "                                      
 [34] "\n                            \n                            Bouvet Island\n                        "                               
 [35] "\n                            \n                            Botswana\n                        "                                    
 [36] "\n                            \n                            Belarus\n                        "                                     
 [37] "\n                            \n                            Belize\n                        "                                      
 [38] "\n                            \n                            Canada\n                        "                                      
 [39] "\n                            \n                            Cocos [Keeling] Islands\n                        "                     
 [40] "\n                            \n                            Democratic Republic of the Congo\n                        "            
 [41] "\n                            \n                            Central African Republic\n                        "                    
 [42] "\n                            \n                            Republic of the Congo\n                        "                       
 [43] "\n                            \n                            Switzerland\n                        "                                 
 [44] "\n                            \n                            Ivory Coast\n                        "                                 
 [45] "\n                            \n                            Cook Islands\n                        "                                
 [46] "\n                            \n                            Chile\n                        "                                       
 [47] "\n                            \n                            Cameroon\n                        "                                    
 [48] "\n                            \n                            China\n                        "                                       
 [49] "\n                            \n                            Colombia\n                        "                                    
 [50] "\n                            \n                            Costa Rica\n                        "                                  
 [51] "\n                            \n                            Cuba\n                        "                                        
 [52] "\n                            \n                            Cape Verde\n                        "                                  
 [53] "\n                            \n                            Curacao\n                        "                                     
 [54] "\n                            \n                            Christmas Island\n                        "                            
 [55] "\n                            \n                            Cyprus\n                        "                                      
 [56] "\n                            \n                            Czech Republic\n                        "                              
 [57] "\n                            \n                            Germany\n                        "                                     
 [58] "\n                            \n                            Djibouti\n                        "                                    
 [59] "\n                            \n                            Denmark\n                        "                                     
 [60] "\n                            \n                            Dominica\n                        "                                    
 [61] "\n                            \n                            Dominican Republic\n                        "                          
 [62] "\n                            \n                            Algeria\n                        "                                     
 [63] "\n                            \n                            Ecuador\n                        "                                     
 [64] "\n                            \n                            Estonia\n                        "                                     
 [65] "\n                            \n                            Egypt\n                        "                                       
 [66] "\n                            \n                            Western Sahara\n                        "                              
 [67] "\n                            \n                            Eritrea\n                        "                                     
 [68] "\n                            \n                            Spain\n                        "                                       
 [69] "\n                            \n                            Ethiopia\n                        "                                    
 [70] "\n                            \n                            Finland\n                        "                                     
 [71] "\n                            \n                            Fiji\n                        "                                        
 [72] "\n                            \n                            Falkland Islands\n                        "                            
 [73] "\n                            \n                            Micronesia\n                        "                                  
 [74] "\n                            \n                            Faroe Islands\n                        "                               
 [75] "\n                            \n                            France\n                        "                                      
 [76] "\n                            \n                            Gabon\n                        "                                       
 [77] "\n                            \n                            United Kingdom\n                        "                              
 [78] "\n                            \n                            Grenada\n                        "                                     
 [79] "\n                            \n                            Georgia\n                        "                                     
 [80] "\n                            \n                            French Guiana\n                        "                               
 [81] "\n                            \n                            Guernsey\n                        "                                    
 [82] "\n                            \n                            Ghana\n                        "                                       
 [83] "\n                            \n                            Gibraltar\n                        "                                   
 [84] "\n                            \n                            Greenland\n                        "                                   
 [85] "\n                            \n                            Gambia\n                        "                                      
 [86] "\n                            \n                            Guinea\n                        "                                      
 [87] "\n                            \n                            Guadeloupe\n                        "                                  
 [88] "\n                            \n                            Equatorial Guinea\n                        "                           
 [89] "\n                            \n                            Greece\n                        "                                      
 [90] "\n                            \n                            South Georgia and the South Sandwich Islands\n                        "
 [91] "\n                            \n                            Guatemala\n                        "                                   
 [92] "\n                            \n                            Guam\n                        "                                        
 [93] "\n                            \n                            Guinea-Bissau\n                        "                               
 [94] "\n                            \n                            Guyana\n                        "                                      
 [95] "\n                            \n                            Hong Kong\n                        "                                   
 [96] "\n                            \n                            Heard Island and McDonald Islands\n                        "           
 [97] "\n                            \n                            Honduras\n                        "                                    
 [98] "\n                            \n                            Croatia\n                        "                                     
 [99] "\n                            \n                            Haiti\n                        "                                       
[100] "\n                            \n                            Hungary\n                        "                                     
[101] "\n                            \n                            Indonesia\n                        "                                   
[102] "\n                            \n                            Ireland\n                        "                                     
[103] "\n                            \n                            Israel\n                        "                                      
[104] "\n                            \n                            Isle of Man\n                        "                                 
[105] "\n                            \n                            India\n                        "                                       
[106] "\n                            \n                            British Indian Ocean Territory\n                        "              
[107] "\n                            \n                            Iraq\n                        "                                        
[108] "\n                            \n                            Iran\n                        "                                        
[109] "\n                            \n                            Iceland\n                        "                                     
[110] "\n                            \n                            Italy\n                        "                                       
[111] "\n                            \n                            Jersey\n                        "                                      
[112] "\n                            \n                            Jamaica\n                        "                                     
[113] "\n                            \n                            Jordan\n                        "                                      
[114] "\n                            \n                            Japan\n                        "                                       
[115] "\n                            \n                            Kenya\n                        "                                       
[116] "\n                            \n                            Kyrgyzstan\n                        "                                  
[117] "\n                            \n                            Cambodia\n                        "                                    
[118] "\n                            \n                            Kiribati\n                        "                                    
[119] "\n                            \n                            Comoros\n                        "                                     
[120] "\n                            \n                            Saint Kitts and Nevis\n                        "                       
[121] "\n                            \n                            North Korea\n                        "                                 
[122] "\n                            \n                            South Korea\n                        "                                 
[123] "\n                            \n                            Kuwait\n                        "                                      
[124] "\n                            \n                            Cayman Islands\n                        "                              
[125] "\n                            \n                            Kazakhstan\n                        "                                  
[126] "\n                            \n                            Laos\n                        "                                        
[127] "\n                            \n                            Lebanon\n                        "                                     
[128] "\n                            \n                            Saint Lucia\n                        "                                 
[129] "\n                            \n                            Liechtenstein\n                        "                               
[130] "\n                            \n                            Sri Lanka\n                        "                                   
[131] "\n                            \n                            Liberia\n                        "                                     
[132] "\n                            \n                            Lesotho\n                        "                                     
[133] "\n                            \n                            Lithuania\n                        "                                   
[134] "\n                            \n                            Luxembourg\n                        "                                  
[135] "\n                            \n                            Latvia\n                        "                                      
[136] "\n                            \n                            Libya\n                        "                                       
[137] "\n                            \n                            Morocco\n                        "                                     
[138] "\n                            \n                            Monaco\n                        "                                      
[139] "\n                            \n                            Moldova\n                        "                                     
[140] "\n                            \n                            Montenegro\n                        "                                  
[141] "\n                            \n                            Saint Martin\n                        "                                
[142] "\n                            \n                            Madagascar\n                        "                                  
[143] "\n                            \n                            Marshall Islands\n                        "                            
[144] "\n                            \n                            Macedonia\n                        "                                   
[145] "\n                            \n                            Mali\n                        "                                        
[146] "\n                            \n                            Myanmar [Burma]\n                        "                             
[147] "\n                            \n                            Mongolia\n                        "                                    
[148] "\n                            \n                            Macao\n                        "                                       
[149] "\n                            \n                            Northern Mariana Islands\n                        "                    
[150] "\n                            \n                            Martinique\n                        "                                  
[151] "\n                            \n                            Mauritania\n                        "                                  
[152] "\n                            \n                            Montserrat\n                        "                                  
[153] "\n                            \n                            Malta\n                        "                                       
[154] "\n                            \n                            Mauritius\n                        "                                   
[155] "\n                            \n                            Maldives\n                        "                                    
[156] "\n                            \n                            Malawi\n                        "                                      
[157] "\n                            \n                            Mexico\n                        "                                      
[158] "\n                            \n                            Malaysia\n                        "                                    
[159] "\n                            \n                            Mozambique\n                        "                                  
[160] "\n                            \n                            Namibia\n                        "                                     
[161] "\n                            \n                            New Caledonia\n                        "                               
[162] "\n                            \n                            Niger\n                        "                                       
[163] "\n                            \n                            Norfolk Island\n                        "                              
[164] "\n                            \n                            Nigeria\n                        "                                     
[165] "\n                            \n                            Nicaragua\n                        "                                   
[166] "\n                            \n                            Netherlands\n                        "                                 
[167] "\n                            \n                            Norway\n                        "                                      
[168] "\n                            \n                            Nepal\n                        "                                       
[169] "\n                            \n                            Nauru\n                        "                                       
[170] "\n                            \n                            Niue\n                        "                                        
[171] "\n                            \n                            New Zealand\n                        "                                 
[172] "\n                            \n                            Oman\n                        "                                        
[173] "\n                            \n                            Panama\n                        "                                      
[174] "\n                            \n                            Peru\n                        "                                        
[175] "\n                            \n                            French Polynesia\n                        "                            
[176] "\n                            \n                            Papua New Guinea\n                        "                            
[177] "\n                            \n                            Philippines\n                        "                                 
[178] "\n                            \n                            Pakistan\n                        "                                    
[179] "\n                            \n                            Poland\n                        "                                      
[180] "\n                            \n                            Saint Pierre and Miquelon\n                        "                   
[181] "\n                            \n                            Pitcairn Islands\n                        "                            
[182] "\n                            \n                            Puerto Rico\n                        "                                 
[183] "\n                            \n                            Palestine\n                        "                                   
[184] "\n                            \n                            Portugal\n                        "                                    
[185] "\n                            \n                            Palau\n                        "                                       
[186] "\n                            \n                            Paraguay\n                        "                                    
[187] "\n                            \n                            Qatar\n                        "                                       
[188] "\n                            \n                            Réunion\n                        "                                     
[189] "\n                            \n                            Romania\n                        "                                     
[190] "\n                            \n                            Serbia\n                        "                                      
[191] "\n                            \n                            Russia\n                        "                                      
[192] "\n                            \n                            Rwanda\n                        "                                      
[193] "\n                            \n                            Saudi Arabia\n                        "                                
[194] "\n                            \n                            Solomon Islands\n                        "                             
[195] "\n                            \n                            Seychelles\n                        "                                  
[196] "\n                            \n                            Sudan\n                        "                                       
[197] "\n                            \n                            Sweden\n                        "                                      
[198] "\n                            \n                            Singapore\n                        "                                   
[199] "\n                            \n                            Saint Helena\n                        "                                
[200] "\n                            \n                            Slovenia\n                        "                                    
[201] "\n                            \n                            Svalbard and Jan Mayen\n                        "                      
[202] "\n                            \n                            Slovakia\n                        "                                    
[203] "\n                            \n                            Sierra Leone\n                        "                                
[204] "\n                            \n                            San Marino\n                        "                                  
[205] "\n                            \n                            Senegal\n                        "                                     
[206] "\n                            \n                            Somalia\n                        "                                     
[207] "\n                            \n                            Suriname\n                        "                                    
[208] "\n                            \n                            South Sudan\n                        "                                 
[209] "\n                            \n                            São Tomé and Príncipe\n                        "                       
[210] "\n                            \n                            El Salvador\n                        "                                 
[211] "\n                            \n                            Sint Maarten\n                        "                                
[212] "\n                            \n                            Syria\n                        "                                       
[213] "\n                            \n                            Swaziland\n                        "                                   
[214] "\n                            \n                            Turks and Caicos Islands\n                        "                    
[215] "\n                            \n                            Chad\n                        "                                        
[216] "\n                            \n                            French Southern Territories\n                        "                 
[217] "\n                            \n                            Togo\n                        "                                        
[218] "\n                            \n                            Thailand\n                        "                                    
[219] "\n                            \n                            Tajikistan\n                        "                                  
[220] "\n                            \n                            Tokelau\n                        "                                     
[221] "\n                            \n                            East Timor\n                        "                                  
[222] "\n                            \n                            Turkmenistan\n                        "                                
[223] "\n                            \n                            Tunisia\n                        "                                     
[224] "\n                            \n                            Tonga\n                        "                                       
[225] "\n                            \n                            Turkey\n                        "                                      
[226] "\n                            \n                            Trinidad and Tobago\n                        "                         
[227] "\n                            \n                            Tuvalu\n                        "                                      
[228] "\n                            \n                            Taiwan\n                        "                                      
[229] "\n                            \n                            Tanzania\n                        "                                    
[230] "\n                            \n                            Ukraine\n                        "                                     
[231] "\n                            \n                            Uganda\n                        "                                      
[232] "\n                            \n                            U.S. Minor Outlying Islands\n                        "                 
[233] "\n                            \n                            United States\n                        "                               
[234] "\n                            \n                            Uruguay\n                        "                                     
[235] "\n                            \n                            Uzbekistan\n                        "                                  
[236] "\n                            \n                            Vatican City\n                        "                                
[237] "\n                            \n                            Saint Vincent and the Grenadines\n                        "            
[238] "\n                            \n                            Venezuela\n                        "                                   
[239] "\n                            \n                            British Virgin Islands\n                        "                      
[240] "\n                            \n                            U.S. Virgin Islands\n                        "                         
[241] "\n                            \n                            Vietnam\n                        "                                     
[242] "\n                            \n                            Vanuatu\n                        "                                     
[243] "\n                            \n                            Wallis and Futuna\n                        "                           
[244] "\n                            \n                            Samoa\n                        "                                       
[245] "\n                            \n                            Kosovo\n                        "                                      
[246] "\n                            \n                            Yemen\n                        "                                       
[247] "\n                            \n                            Mayotte\n                        "                                     
[248] "\n                            \n                            South Africa\n                        "                                
[249] "\n                            \n                            Zambia\n                        "                                      
[250] "\n                            \n                            Zimbabwe\n                        "                                    
# remove the spaces and newline characters from countries
countries= gsub("\\n", "", countries)
countries= gsub("(^ *)|( *$)", "" , countries)
countries
  [1] "Andorra"                                     
  [2] "United Arab Emirates"                        
  [3] "Afghanistan"                                 
  [4] "Antigua and Barbuda"                         
  [5] "Anguilla"                                    
  [6] "Albania"                                     
  [7] "Armenia"                                     
  [8] "Angola"                                      
  [9] "Antarctica"                                  
 [10] "Argentina"                                   
 [11] "American Samoa"                              
 [12] "Austria"                                     
 [13] "Australia"                                   
 [14] "Aruba"                                       
 [15] "Åland"                                       
 [16] "Azerbaijan"                                  
 [17] "Bosnia and Herzegovina"                      
 [18] "Barbados"                                    
 [19] "Bangladesh"                                  
 [20] "Belgium"                                     
 [21] "Burkina Faso"                                
 [22] "Bulgaria"                                    
 [23] "Bahrain"                                     
 [24] "Burundi"                                     
 [25] "Benin"                                       
 [26] "Saint Barthélemy"                            
 [27] "Bermuda"                                     
 [28] "Brunei"                                      
 [29] "Bolivia"                                     
 [30] "Bonaire"                                     
 [31] "Brazil"                                      
 [32] "Bahamas"                                     
 [33] "Bhutan"                                      
 [34] "Bouvet Island"                               
 [35] "Botswana"                                    
 [36] "Belarus"                                     
 [37] "Belize"                                      
 [38] "Canada"                                      
 [39] "Cocos [Keeling] Islands"                     
 [40] "Democratic Republic of the Congo"            
 [41] "Central African Republic"                    
 [42] "Republic of the Congo"                       
 [43] "Switzerland"                                 
 [44] "Ivory Coast"                                 
 [45] "Cook Islands"                                
 [46] "Chile"                                       
 [47] "Cameroon"                                    
 [48] "China"                                       
 [49] "Colombia"                                    
 [50] "Costa Rica"                                  
 [51] "Cuba"                                        
 [52] "Cape Verde"                                  
 [53] "Curacao"                                     
 [54] "Christmas Island"                            
 [55] "Cyprus"                                      
 [56] "Czech Republic"                              
 [57] "Germany"                                     
 [58] "Djibouti"                                    
 [59] "Denmark"                                     
 [60] "Dominica"                                    
 [61] "Dominican Republic"                          
 [62] "Algeria"                                     
 [63] "Ecuador"                                     
 [64] "Estonia"                                     
 [65] "Egypt"                                       
 [66] "Western Sahara"                              
 [67] "Eritrea"                                     
 [68] "Spain"                                       
 [69] "Ethiopia"                                    
 [70] "Finland"                                     
 [71] "Fiji"                                        
 [72] "Falkland Islands"                            
 [73] "Micronesia"                                  
 [74] "Faroe Islands"                               
 [75] "France"                                      
 [76] "Gabon"                                       
 [77] "United Kingdom"                              
 [78] "Grenada"                                     
 [79] "Georgia"                                     
 [80] "French Guiana"                               
 [81] "Guernsey"                                    
 [82] "Ghana"                                       
 [83] "Gibraltar"                                   
 [84] "Greenland"                                   
 [85] "Gambia"                                      
 [86] "Guinea"                                      
 [87] "Guadeloupe"                                  
 [88] "Equatorial Guinea"                           
 [89] "Greece"                                      
 [90] "South Georgia and the South Sandwich Islands"
 [91] "Guatemala"                                   
 [92] "Guam"                                        
 [93] "Guinea-Bissau"                               
 [94] "Guyana"                                      
 [95] "Hong Kong"                                   
 [96] "Heard Island and McDonald Islands"           
 [97] "Honduras"                                    
 [98] "Croatia"                                     
 [99] "Haiti"                                       
[100] "Hungary"                                     
[101] "Indonesia"                                   
[102] "Ireland"                                     
[103] "Israel"                                      
[104] "Isle of Man"                                 
[105] "India"                                       
[106] "British Indian Ocean Territory"              
[107] "Iraq"                                        
[108] "Iran"                                        
[109] "Iceland"                                     
[110] "Italy"                                       
[111] "Jersey"                                      
[112] "Jamaica"                                     
[113] "Jordan"                                      
[114] "Japan"                                       
[115] "Kenya"                                       
[116] "Kyrgyzstan"                                  
[117] "Cambodia"                                    
[118] "Kiribati"                                    
[119] "Comoros"                                     
[120] "Saint Kitts and Nevis"                       
[121] "North Korea"                                 
[122] "South Korea"                                 
[123] "Kuwait"                                      
[124] "Cayman Islands"                              
[125] "Kazakhstan"                                  
[126] "Laos"                                        
[127] "Lebanon"                                     
[128] "Saint Lucia"                                 
[129] "Liechtenstein"                               
[130] "Sri Lanka"                                   
[131] "Liberia"                                     
[132] "Lesotho"                                     
[133] "Lithuania"                                   
[134] "Luxembourg"                                  
[135] "Latvia"                                      
[136] "Libya"                                       
[137] "Morocco"                                     
[138] "Monaco"                                      
[139] "Moldova"                                     
[140] "Montenegro"                                  
[141] "Saint Martin"                                
[142] "Madagascar"                                  
[143] "Marshall Islands"                            
[144] "Macedonia"                                   
[145] "Mali"                                        
[146] "Myanmar [Burma]"                             
[147] "Mongolia"                                    
[148] "Macao"                                       
[149] "Northern Mariana Islands"                    
[150] "Martinique"                                  
[151] "Mauritania"                                  
[152] "Montserrat"                                  
[153] "Malta"                                       
[154] "Mauritius"                                   
[155] "Maldives"                                    
[156] "Malawi"                                      
[157] "Mexico"                                      
[158] "Malaysia"                                    
[159] "Mozambique"                                  
[160] "Namibia"                                     
[161] "New Caledonia"                               
[162] "Niger"                                       
[163] "Norfolk Island"                              
[164] "Nigeria"                                     
[165] "Nicaragua"                                   
[166] "Netherlands"                                 
[167] "Norway"                                      
[168] "Nepal"                                       
[169] "Nauru"                                       
[170] "Niue"                                        
[171] "New Zealand"                                 
[172] "Oman"                                        
[173] "Panama"                                      
[174] "Peru"                                        
[175] "French Polynesia"                            
[176] "Papua New Guinea"                            
[177] "Philippines"                                 
[178] "Pakistan"                                    
[179] "Poland"                                      
[180] "Saint Pierre and Miquelon"                   
[181] "Pitcairn Islands"                            
[182] "Puerto Rico"                                 
[183] "Palestine"                                   
[184] "Portugal"                                    
[185] "Palau"                                       
[186] "Paraguay"                                    
[187] "Qatar"                                       
[188] "Réunion"                                     
[189] "Romania"                                     
[190] "Serbia"                                      
[191] "Russia"                                      
[192] "Rwanda"                                      
[193] "Saudi Arabia"                                
[194] "Solomon Islands"                             
[195] "Seychelles"                                  
[196] "Sudan"                                       
[197] "Sweden"                                      
[198] "Singapore"                                   
[199] "Saint Helena"                                
[200] "Slovenia"                                    
[201] "Svalbard and Jan Mayen"                      
[202] "Slovakia"                                    
[203] "Sierra Leone"                                
[204] "San Marino"                                  
[205] "Senegal"                                     
[206] "Somalia"                                     
[207] "Suriname"                                    
[208] "South Sudan"                                 
[209] "São Tomé and Príncipe"                       
[210] "El Salvador"                                 
[211] "Sint Maarten"                                
[212] "Syria"                                       
[213] "Swaziland"                                   
[214] "Turks and Caicos Islands"                    
[215] "Chad"                                        
[216] "French Southern Territories"                 
[217] "Togo"                                        
[218] "Thailand"                                    
[219] "Tajikistan"                                  
[220] "Tokelau"                                     
[221] "East Timor"                                  
[222] "Turkmenistan"                                
[223] "Tunisia"                                     
[224] "Tonga"                                       
[225] "Turkey"                                      
[226] "Trinidad and Tobago"                         
[227] "Tuvalu"                                      
[228] "Taiwan"                                      
[229] "Tanzania"                                    
[230] "Ukraine"                                     
[231] "Uganda"                                      
[232] "U.S. Minor Outlying Islands"                 
[233] "United States"                               
[234] "Uruguay"                                     
[235] "Uzbekistan"                                  
[236] "Vatican City"                                
[237] "Saint Vincent and the Grenadines"            
[238] "Venezuela"                                   
[239] "British Virgin Islands"                      
[240] "U.S. Virgin Islands"                         
[241] "Vietnam"                                     
[242] "Vanuatu"                                     
[243] "Wallis and Futuna"                           
[244] "Samoa"                                       
[245] "Kosovo"                                      
[246] "Yemen"                                       
[247] "Mayotte"                                     
[248] "South Africa"                                
[249] "Zambia"                                      
[250] "Zimbabwe"                                    
head(countries,5)  # the United Arab Emirates has no spaces! Let's fix that ...
[1] "Andorra"              "United Arab Emirates" "Afghanistan"         
[4] "Antigua and Barbuda"  "Anguilla"            
countries = the_full_html %>%
  html_elements( ".country-name" ) %>%
  html_text()

# remove the spaces and newline characters from countries
# Do not remove spaces between words
countries= gsub("\\n", "", countries)
countries= gsub("^ *", "" , countries)  # get rid of spaces before the text
countries= gsub(" *$", "" , countries)  # get rid of spaces after the text
countries
  [1] "Andorra"                                     
  [2] "United Arab Emirates"                        
  [3] "Afghanistan"                                 
  [4] "Antigua and Barbuda"                         
  [5] "Anguilla"                                    
  [6] "Albania"                                     
  [7] "Armenia"                                     
  [8] "Angola"                                      
  [9] "Antarctica"                                  
 [10] "Argentina"                                   
 [11] "American Samoa"                              
 [12] "Austria"                                     
 [13] "Australia"                                   
 [14] "Aruba"                                       
 [15] "Åland"                                       
 [16] "Azerbaijan"                                  
 [17] "Bosnia and Herzegovina"                      
 [18] "Barbados"                                    
 [19] "Bangladesh"                                  
 [20] "Belgium"                                     
 [21] "Burkina Faso"                                
 [22] "Bulgaria"                                    
 [23] "Bahrain"                                     
 [24] "Burundi"                                     
 [25] "Benin"                                       
 [26] "Saint Barthélemy"                            
 [27] "Bermuda"                                     
 [28] "Brunei"                                      
 [29] "Bolivia"                                     
 [30] "Bonaire"                                     
 [31] "Brazil"                                      
 [32] "Bahamas"                                     
 [33] "Bhutan"                                      
 [34] "Bouvet Island"                               
 [35] "Botswana"                                    
 [36] "Belarus"                                     
 [37] "Belize"                                      
 [38] "Canada"                                      
 [39] "Cocos [Keeling] Islands"                     
 [40] "Democratic Republic of the Congo"            
 [41] "Central African Republic"                    
 [42] "Republic of the Congo"                       
 [43] "Switzerland"                                 
 [44] "Ivory Coast"                                 
 [45] "Cook Islands"                                
 [46] "Chile"                                       
 [47] "Cameroon"                                    
 [48] "China"                                       
 [49] "Colombia"                                    
 [50] "Costa Rica"                                  
 [51] "Cuba"                                        
 [52] "Cape Verde"                                  
 [53] "Curacao"                                     
 [54] "Christmas Island"                            
 [55] "Cyprus"                                      
 [56] "Czech Republic"                              
 [57] "Germany"                                     
 [58] "Djibouti"                                    
 [59] "Denmark"                                     
 [60] "Dominica"                                    
 [61] "Dominican Republic"                          
 [62] "Algeria"                                     
 [63] "Ecuador"                                     
 [64] "Estonia"                                     
 [65] "Egypt"                                       
 [66] "Western Sahara"                              
 [67] "Eritrea"                                     
 [68] "Spain"                                       
 [69] "Ethiopia"                                    
 [70] "Finland"                                     
 [71] "Fiji"                                        
 [72] "Falkland Islands"                            
 [73] "Micronesia"                                  
 [74] "Faroe Islands"                               
 [75] "France"                                      
 [76] "Gabon"                                       
 [77] "United Kingdom"                              
 [78] "Grenada"                                     
 [79] "Georgia"                                     
 [80] "French Guiana"                               
 [81] "Guernsey"                                    
 [82] "Ghana"                                       
 [83] "Gibraltar"                                   
 [84] "Greenland"                                   
 [85] "Gambia"                                      
 [86] "Guinea"                                      
 [87] "Guadeloupe"                                  
 [88] "Equatorial Guinea"                           
 [89] "Greece"                                      
 [90] "South Georgia and the South Sandwich Islands"
 [91] "Guatemala"                                   
 [92] "Guam"                                        
 [93] "Guinea-Bissau"                               
 [94] "Guyana"                                      
 [95] "Hong Kong"                                   
 [96] "Heard Island and McDonald Islands"           
 [97] "Honduras"                                    
 [98] "Croatia"                                     
 [99] "Haiti"                                       
[100] "Hungary"                                     
[101] "Indonesia"                                   
[102] "Ireland"                                     
[103] "Israel"                                      
[104] "Isle of Man"                                 
[105] "India"                                       
[106] "British Indian Ocean Territory"              
[107] "Iraq"                                        
[108] "Iran"                                        
[109] "Iceland"                                     
[110] "Italy"                                       
[111] "Jersey"                                      
[112] "Jamaica"                                     
[113] "Jordan"                                      
[114] "Japan"                                       
[115] "Kenya"                                       
[116] "Kyrgyzstan"                                  
[117] "Cambodia"                                    
[118] "Kiribati"                                    
[119] "Comoros"                                     
[120] "Saint Kitts and Nevis"                       
[121] "North Korea"                                 
[122] "South Korea"                                 
[123] "Kuwait"                                      
[124] "Cayman Islands"                              
[125] "Kazakhstan"                                  
[126] "Laos"                                        
[127] "Lebanon"                                     
[128] "Saint Lucia"                                 
[129] "Liechtenstein"                               
[130] "Sri Lanka"                                   
[131] "Liberia"                                     
[132] "Lesotho"                                     
[133] "Lithuania"                                   
[134] "Luxembourg"                                  
[135] "Latvia"                                      
[136] "Libya"                                       
[137] "Morocco"                                     
[138] "Monaco"                                      
[139] "Moldova"                                     
[140] "Montenegro"                                  
[141] "Saint Martin"                                
[142] "Madagascar"                                  
[143] "Marshall Islands"                            
[144] "Macedonia"                                   
[145] "Mali"                                        
[146] "Myanmar [Burma]"                             
[147] "Mongolia"                                    
[148] "Macao"                                       
[149] "Northern Mariana Islands"                    
[150] "Martinique"                                  
[151] "Mauritania"                                  
[152] "Montserrat"                                  
[153] "Malta"                                       
[154] "Mauritius"                                   
[155] "Maldives"                                    
[156] "Malawi"                                      
[157] "Mexico"                                      
[158] "Malaysia"                                    
[159] "Mozambique"                                  
[160] "Namibia"                                     
[161] "New Caledonia"                               
[162] "Niger"                                       
[163] "Norfolk Island"                              
[164] "Nigeria"                                     
[165] "Nicaragua"                                   
[166] "Netherlands"                                 
[167] "Norway"                                      
[168] "Nepal"                                       
[169] "Nauru"                                       
[170] "Niue"                                        
[171] "New Zealand"                                 
[172] "Oman"                                        
[173] "Panama"                                      
[174] "Peru"                                        
[175] "French Polynesia"                            
[176] "Papua New Guinea"                            
[177] "Philippines"                                 
[178] "Pakistan"                                    
[179] "Poland"                                      
[180] "Saint Pierre and Miquelon"                   
[181] "Pitcairn Islands"                            
[182] "Puerto Rico"                                 
[183] "Palestine"                                   
[184] "Portugal"                                    
[185] "Palau"                                       
[186] "Paraguay"                                    
[187] "Qatar"                                       
[188] "Réunion"                                     
[189] "Romania"                                     
[190] "Serbia"                                      
[191] "Russia"                                      
[192] "Rwanda"                                      
[193] "Saudi Arabia"                                
[194] "Solomon Islands"                             
[195] "Seychelles"                                  
[196] "Sudan"                                       
[197] "Sweden"                                      
[198] "Singapore"                                   
[199] "Saint Helena"                                
[200] "Slovenia"                                    
[201] "Svalbard and Jan Mayen"                      
[202] "Slovakia"                                    
[203] "Sierra Leone"                                
[204] "San Marino"                                  
[205] "Senegal"                                     
[206] "Somalia"                                     
[207] "Suriname"                                    
[208] "South Sudan"                                 
[209] "São Tomé and Príncipe"                       
[210] "El Salvador"                                 
[211] "Sint Maarten"                                
[212] "Syria"                                       
[213] "Swaziland"                                   
[214] "Turks and Caicos Islands"                    
[215] "Chad"                                        
[216] "French Southern Territories"                 
[217] "Togo"                                        
[218] "Thailand"                                    
[219] "Tajikistan"                                  
[220] "Tokelau"                                     
[221] "East Timor"                                  
[222] "Turkmenistan"                                
[223] "Tunisia"                                     
[224] "Tonga"                                       
[225] "Turkey"                                      
[226] "Trinidad and Tobago"                         
[227] "Tuvalu"                                      
[228] "Taiwan"                                      
[229] "Tanzania"                                    
[230] "Ukraine"                                     
[231] "Uganda"                                      
[232] "U.S. Minor Outlying Islands"                 
[233] "United States"                               
[234] "Uruguay"                                     
[235] "Uzbekistan"                                  
[236] "Vatican City"                                
[237] "Saint Vincent and the Grenadines"            
[238] "Venezuela"                                   
[239] "British Virgin Islands"                      
[240] "U.S. Virgin Islands"                         
[241] "Vietnam"                                     
[242] "Vanuatu"                                     
[243] "Wallis and Futuna"                           
[244] "Samoa"                                       
[245] "Kosovo"                                      
[246] "Yemen"                                       
[247] "Mayotte"                                     
[248] "South Africa"                                
[249] "Zambia"                                      
[250] "Zimbabwe"                                    
capitals
  [1] "Andorra la Vella"    "Abu Dhabi"           "Kabul"              
  [4] "St. John's"          "The Valley"          "Tirana"             
  [7] "Yerevan"             "Luanda"              "None"               
 [10] "Buenos Aires"        "Pago Pago"           "Vienna"             
 [13] "Canberra"            "Oranjestad"          "Mariehamn"          
 [16] "Baku"                "Sarajevo"            "Bridgetown"         
 [19] "Dhaka"               "Brussels"            "Ouagadougou"        
 [22] "Sofia"               "Manama"              "Bujumbura"          
 [25] "Porto-Novo"          "Gustavia"            "Hamilton"           
 [28] "Bandar Seri Begawan" "Sucre"               "Kralendijk"         
 [31] "Brasília"            "Nassau"              "Thimphu"            
 [34] "None"                "Gaborone"            "Minsk"              
 [37] "Belmopan"            "Ottawa"              "West Island"        
 [40] "Kinshasa"            "Bangui"              "Brazzaville"        
 [43] "Bern"                "Yamoussoukro"        "Avarua"             
 [46] "Santiago"            "Yaoundé"             "Beijing"            
 [49] "Bogotá"              "San José"            "Havana"             
 [52] "Praia"               "Willemstad"          "Flying Fish Cove"   
 [55] "Nicosia"             "Prague"              "Berlin"             
 [58] "Djibouti"            "Copenhagen"          "Roseau"             
 [61] "Santo Domingo"       "Algiers"             "Quito"              
 [64] "Tallinn"             "Cairo"               "Laâyoune / El Aaiún"
 [67] "Asmara"              "Madrid"              "Addis Ababa"        
 [70] "Helsinki"            "Suva"                "Stanley"            
 [73] "Palikir"             "Tórshavn"            "Paris"              
 [76] "Libreville"          "London"              "St. George's"       
 [79] "Tbilisi"             "Cayenne"             "St Peter Port"      
 [82] "Accra"               "Gibraltar"           "Nuuk"               
 [85] "Bathurst"            "Conakry"             "Basse-Terre"        
 [88] "Malabo"              "Athens"              "Grytviken"          
 [91] "Guatemala City"      "Hagåtña"             "Bissau"             
 [94] "Georgetown"          "Hong Kong"           "None"               
 [97] "Tegucigalpa"         "Zagreb"              "Port-au-Prince"     
[100] "Budapest"            "Jakarta"             "Dublin"             
[103] "None"                "Douglas"             "New Delhi"          
[106] "None"                "Baghdad"             "Tehran"             
[109] "Reykjavik"           "Rome"                "Saint Helier"       
[112] "Kingston"            "Amman"               "Tokyo"              
[115] "Nairobi"             "Bishkek"             "Phnom Penh"         
[118] "Tarawa"              "Moroni"              "Basseterre"         
[121] "Pyongyang"           "Seoul"               "Kuwait City"        
[124] "George Town"         "Astana"              "Vientiane"          
[127] "Beirut"              "Castries"            "Vaduz"              
[130] "Colombo"             "Monrovia"            "Maseru"             
[133] "Vilnius"             "Luxembourg"          "Riga"               
[136] "Tripoli"             "Rabat"               "Monaco"             
[139] "Chişinău"            "Podgorica"           "Marigot"            
[142] "Antananarivo"        "Majuro"              "Skopje"             
[145] "Bamako"              "Naypyitaw"           "Ulan Bator"         
[148] "Macao"               "Saipan"              "Fort-de-France"     
[151] "Nouakchott"          "Plymouth"            "Valletta"           
[154] "Port Louis"          "Malé"                "Lilongwe"           
[157] "Mexico City"         "Kuala Lumpur"        "Maputo"             
[160] "Windhoek"            "Noumea"              "Niamey"             
[163] "Kingston"            "Abuja"               "Managua"            
[166] "Amsterdam"           "Oslo"                "Kathmandu"          
[169] "Yaren"               "Alofi"               "Wellington"         
[172] "Muscat"              "Panama City"         "Lima"               
[175] "Papeete"             "Port Moresby"        "Manila"             
[178] "Islamabad"           "Warsaw"              "Saint-Pierre"       
[181] "Adamstown"           "San Juan"            "None"               
[184] "Lisbon"              "Melekeok"            "Asunción"           
[187] "Doha"                "Saint-Denis"         "Bucharest"          
[190] "Belgrade"            "Moscow"              "Kigali"             
[193] "Riyadh"              "Honiara"             "Victoria"           
[196] "Khartoum"            "Stockholm"           "Singapore"          
[199] "Jamestown"           "Ljubljana"           "Longyearbyen"       
[202] "Bratislava"          "Freetown"            "San Marino"         
[205] "Dakar"               "Mogadishu"           "Paramaribo"         
[208] "Juba"                "São Tomé"            "San Salvador"       
[211] "Philipsburg"         "Damascus"            "Mbabane"            
[214] "Cockburn Town"       "N'Djamena"           "Port-aux-Français"  
[217] "Lomé"                "Bangkok"             "Dushanbe"           
[220] "None"                "Dili"                "Ashgabat"           
[223] "Tunis"               "Nuku'alofa"          "Ankara"             
[226] "Port of Spain"       "Funafuti"            "Taipei"             
[229] "Dodoma"              "Kiev"                "Kampala"            
[232] "None"                "Washington"          "Montevideo"         
[235] "Tashkent"            "Vatican City"        "Kingstown"          
[238] "Caracas"             "Road Town"           "Charlotte Amalie"   
[241] "Hanoi"               "Port Vila"           "Mata-Utu"           
[244] "Apia"                "Pristina"            "Sanaa"              
[247] "Mamoudzou"           "Pretoria"            "Lusaka"             
[250] "Harare"             
population
  [1] "84000"      "4975593"    "29121286"   "86754"      "13254"     
  [6] "2986952"    "2968000"    "13068161"   "0"          "41343201"  
 [11] "57881"      "8205000"    "21515754"   "71566"      "26711"     
 [16] "8303512"    "4590000"    "285653"     "156118464"  "10403000"  
 [21] "16241811"   "7148785"    "738004"     "9863117"    "9056010"   
 [26] "8450"       "65365"      "395027"     "9947418"    "18012"     
 [31] "201103330"  "301790"     "699847"     "0"          "2029307"   
 [36] "9685000"    "314522"     "33679000"   "628"        "70916439"  
 [41] "4844927"    "3039126"    "7581000"    "21058798"   "21388"     
 [46] "16746491"   "19294149"   "1330044000" "47790000"   "4516220"   
 [51] "11423000"   "508659"     "141766"     "1500"       "1102677"   
 [56] "10476000"   "81802257"   "740528"     "5484000"    "72813"     
 [61] "9823821"    "34586184"   "14790608"   "1291170"    "80471869"  
 [66] "273008"     "5792984"    "46505963"   "88013491"   "5244000"   
 [71] "875983"     "2638"       "107708"     "48228"      "64768389"  
 [76] "1545255"    "62348447"   "107818"     "4630000"    "195506"    
 [81] "65228"      "24339838"   "27884"      "56375"      "1593256"   
 [86] "10324025"   "443000"     "1014999"    "11000000"   "30"        
 [91] "13550440"   "159358"     "1565126"    "748486"     "6898686"   
 [96] "0"          "7989415"    "4491000"    "9648924"    "9982000"   
[101] "242968342"  "4622917"    "7353985"    "75049"      "1173108018"
[106] "4000"       "29671605"   "76923300"   "308910"     "60340328"  
[111] "90812"      "2847232"    "6407085"    "127288000"  "40046566"  
[116] "5776500"    "14453680"   "92533"      "773407"     "51134"     
[121] "22912177"   "48422644"   "2789132"    "44270"      "15340000"  
[126] "6368162"    "4125247"    "160922"     "35000"      "21513990"  
[131] "3685076"    "1919552"    "2944459"    "497538"     "2217969"   
[136] "6461454"    "31627428"   "32965"      "4324000"    "666730"    
[141] "35925"      "21281844"   "65859"      "2062294"    "13796354"  
[146] "53414374"   "3086918"    "449198"     "53883"      "432900"    
[151] "3205060"    "9341"       "403000"     "1294104"    "395650"    
[156] "15447500"   "112468855"  "28274729"   "22061451"   "2128471"   
[161] "216494"     "15878271"   "1828"       "154000000"  "5995928"   
[166] "16645000"   "5009150"    "28951852"   "10065"      "2166"      
[171] "4252277"    "2967717"    "3410676"    "29907003"   "270485"    
[176] "6064515"    "99900177"   "184404791"  "38500000"   "7012"      
[181] "46"         "3916632"    "3800000"    "10676000"   "19907"     
[186] "6375830"    "840926"     "776948"     "21959278"   "7344847"   
[191] "140702000"  "11055976"   "25731776"   "559198"     "88340"     
[196] "35000000"   "9828655"    "4701069"    "7460"       "2007000"   
[201] "2550"       "5455000"    "5245695"    "31477"      "12323252"  
[206] "10112453"   "492829"     "8260490"    "175808"     "6052064"   
[211] "37429"      "22198110"   "1354051"    "20556"      "10543464"  
[216] "140"        "6587239"    "67089500"   "7487489"    "1466"      
[221] "1154625"    "4940916"    "10589025"   "122580"     "77804122"  
[226] "1228691"    "10472"      "22894384"   "41892895"   "45415596"  
[231] "33398682"   "0"          "310232863"  "3477000"    "27865738"  
[236] "921"        "104217"     "27223228"   "21730"      "108708"    
[241] "89571130"   "221552"     "16025"      "192001"     "1800000"   
[246] "23495361"   "159042"     "49000000"   "13460305"   "11651858"  
area
  [1] "468.0"     "82880.0"   "647500.0"  "443.0"     "102.0"     "28748.0"  
  [7] "29800.0"   "1246700.0" "1.4E7"     "2766890.0" "199.0"     "83858.0"  
 [13] "7686850.0" "193.0"     "1580.0"    "86600.0"   "51129.0"   "431.0"    
 [19] "144000.0"  "30510.0"   "274200.0"  "110910.0"  "665.0"     "27830.0"  
 [25] "112620.0"  "21.0"      "53.0"      "5770.0"    "1098580.0" "328.0"    
 [31] "8511965.0" "13940.0"   "47000.0"   "49.0"      "600370.0"  "207600.0" 
 [37] "22966.0"   "9984670.0" "14.0"      "2345410.0" "622984.0"  "342000.0" 
 [43] "41290.0"   "322460.0"  "240.0"     "756950.0"  "475440.0"  "9596960.0"
 [49] "1138910.0" "51100.0"   "110860.0"  "4033.0"    "444.0"     "135.0"    
 [55] "9250.0"    "78866.0"   "357021.0"  "23000.0"   "43094.0"   "754.0"    
 [61] "48730.0"   "2381740.0" "283560.0"  "45226.0"   "1001450.0" "266000.0" 
 [67] "121320.0"  "504782.0"  "1127127.0" "337030.0"  "18270.0"   "12173.0"  
 [73] "702.0"     "1399.0"    "547030.0"  "267667.0"  "244820.0"  "344.0"    
 [79] "69700.0"   "91000.0"   "78.0"      "239460.0"  "6.5"       "2166086.0"
 [85] "11300.0"   "245857.0"  "1780.0"    "28051.0"   "131940.0"  "3903.0"   
 [91] "108890.0"  "549.0"     "36120.0"   "214970.0"  "1092.0"    "412.0"    
 [97] "112090.0"  "56542.0"   "27750.0"   "93030.0"   "1919440.0" "70280.0"  
[103] "20770.0"   "572.0"     "3287590.0" "60.0"      "437072.0"  "1648000.0"
[109] "103000.0"  "301230.0"  "116.0"     "10991.0"   "92300.0"   "377835.0" 
[115] "582650.0"  "198500.0"  "181040.0"  "811.0"     "2170.0"    "261.0"    
[121] "120540.0"  "98480.0"   "17820.0"   "262.0"     "2717300.0" "236800.0" 
[127] "10400.0"   "616.0"     "160.0"     "65610.0"   "111370.0"  "30355.0"  
[133] "65200.0"   "2586.0"    "64589.0"   "1759540.0" "446550.0"  "1.95"     
[139] "33843.0"   "14026.0"   "53.0"      "587040.0"  "181.3"     "25333.0"  
[145] "1240000.0" "678500.0"  "1565000.0" "254.0"     "477.0"     "1100.0"   
[151] "1030700.0" "102.0"     "316.0"     "2040.0"    "300.0"     "118480.0" 
[157] "1972550.0" "329750.0"  "801590.0"  "825418.0"  "19060.0"   "1267000.0"
[163] "34.6"      "923768.0"  "129494.0"  "41526.0"   "324220.0"  "140800.0" 
[169] "21.0"      "260.0"     "268680.0"  "212460.0"  "78200.0"   "1285220.0"
[175] "4167.0"    "462840.0"  "300000.0"  "803940.0"  "312685.0"  "242.0"    
[181] "47.0"      "9104.0"    "5970.0"    "92391.0"   "458.0"     "406750.0" 
[187] "11437.0"   "2517.0"    "237500.0"  "88361.0"   "1.71E7"    "26338.0"  
[193] "1960582.0" "28450.0"   "455.0"     "1861484.0" "449964.0"  "692.7"    
[199] "410.0"     "20273.0"   "62049.0"   "48845.0"   "71740.0"   "61.2"     
[205] "196190.0"  "637657.0"  "163270.0"  "644329.0"  "1001.0"    "21040.0"  
[211] "21.0"      "185180.0"  "17363.0"   "430.0"     "1284000.0" "7829.0"   
[217] "56785.0"   "514000.0"  "143100.0"  "10.0"      "15007.0"   "488100.0" 
[223] "163610.0"  "748.0"     "780580.0"  "5128.0"    "26.0"      "35980.0"  
[229] "945087.0"  "603700.0"  "236040.0"  "0.0"       "9629091.0" "176220.0" 
[235] "447400.0"  "0.44"      "389.0"     "912050.0"  "153.0"     "352.0"    
[241] "329560.0"  "12200.0"   "274.0"     "2944.0"    "10908.0"   "527970.0" 
[247] "374.0"     "1219912.0" "752614.0"  "390580.0" 
df = data.frame ( country=countries,
                  capital = capitals,
                  pop = population,
                  area = area,
                  stringsAsFactors = FALSE
                  )




head(df, 5)
               country          capital      pop     area
1              Andorra Andorra la Vella    84000    468.0
2 United Arab Emirates        Abu Dhabi  4975593  82880.0
3          Afghanistan            Kabul 29121286 647500.0
4  Antigua and Barbuda       St. John's    86754    443.0
5             Anguilla       The Valley    13254    102.0
nrow(df)
[1] 250

26.10 html_text2() removes leading/trailing whitespace

#####################################################
# UPDATE:
#
# The latest release of the rvest package includes
# a function named html_text2. This function automatically
# removes leading and trailing whitespace.
# It's still important to understand how to modify the 
# data with gsub for situations for which you want to modify
# the data other than leading and trailing whitespace.
#####################################################


url = "https://scrapethissite.com/pages/simple/"
cssSelector = ".country-name"

whole_html_page = read_html(url)
country_name_html =  html_elements( whole_html_page, cssSelector )

# Using html_text2 instead of html_text.
# html_text2 automatically removes leading and trailing whitespace
# from the text.
just_the_text =   html_text2(country_name_html)

just_the_text  
  [1] "Andorra"                                     
  [2] "United Arab Emirates"                        
  [3] "Afghanistan"                                 
  [4] "Antigua and Barbuda"                         
  [5] "Anguilla"                                    
  [6] "Albania"                                     
  [7] "Armenia"                                     
  [8] "Angola"                                      
  [9] "Antarctica"                                  
 [10] "Argentina"                                   
 [11] "American Samoa"                              
 [12] "Austria"                                     
 [13] "Australia"                                   
 [14] "Aruba"                                       
 [15] "Åland"                                       
 [16] "Azerbaijan"                                  
 [17] "Bosnia and Herzegovina"                      
 [18] "Barbados"                                    
 [19] "Bangladesh"                                  
 [20] "Belgium"                                     
 [21] "Burkina Faso"                                
 [22] "Bulgaria"                                    
 [23] "Bahrain"                                     
 [24] "Burundi"                                     
 [25] "Benin"                                       
 [26] "Saint Barthélemy"                            
 [27] "Bermuda"                                     
 [28] "Brunei"                                      
 [29] "Bolivia"                                     
 [30] "Bonaire"                                     
 [31] "Brazil"                                      
 [32] "Bahamas"                                     
 [33] "Bhutan"                                      
 [34] "Bouvet Island"                               
 [35] "Botswana"                                    
 [36] "Belarus"                                     
 [37] "Belize"                                      
 [38] "Canada"                                      
 [39] "Cocos [Keeling] Islands"                     
 [40] "Democratic Republic of the Congo"            
 [41] "Central African Republic"                    
 [42] "Republic of the Congo"                       
 [43] "Switzerland"                                 
 [44] "Ivory Coast"                                 
 [45] "Cook Islands"                                
 [46] "Chile"                                       
 [47] "Cameroon"                                    
 [48] "China"                                       
 [49] "Colombia"                                    
 [50] "Costa Rica"                                  
 [51] "Cuba"                                        
 [52] "Cape Verde"                                  
 [53] "Curacao"                                     
 [54] "Christmas Island"                            
 [55] "Cyprus"                                      
 [56] "Czech Republic"                              
 [57] "Germany"                                     
 [58] "Djibouti"                                    
 [59] "Denmark"                                     
 [60] "Dominica"                                    
 [61] "Dominican Republic"                          
 [62] "Algeria"                                     
 [63] "Ecuador"                                     
 [64] "Estonia"                                     
 [65] "Egypt"                                       
 [66] "Western Sahara"                              
 [67] "Eritrea"                                     
 [68] "Spain"                                       
 [69] "Ethiopia"                                    
 [70] "Finland"                                     
 [71] "Fiji"                                        
 [72] "Falkland Islands"                            
 [73] "Micronesia"                                  
 [74] "Faroe Islands"                               
 [75] "France"                                      
 [76] "Gabon"                                       
 [77] "United Kingdom"                              
 [78] "Grenada"                                     
 [79] "Georgia"                                     
 [80] "French Guiana"                               
 [81] "Guernsey"                                    
 [82] "Ghana"                                       
 [83] "Gibraltar"                                   
 [84] "Greenland"                                   
 [85] "Gambia"                                      
 [86] "Guinea"                                      
 [87] "Guadeloupe"                                  
 [88] "Equatorial Guinea"                           
 [89] "Greece"                                      
 [90] "South Georgia and the South Sandwich Islands"
 [91] "Guatemala"                                   
 [92] "Guam"                                        
 [93] "Guinea-Bissau"                               
 [94] "Guyana"                                      
 [95] "Hong Kong"                                   
 [96] "Heard Island and McDonald Islands"           
 [97] "Honduras"                                    
 [98] "Croatia"                                     
 [99] "Haiti"                                       
[100] "Hungary"                                     
[101] "Indonesia"                                   
[102] "Ireland"                                     
[103] "Israel"                                      
[104] "Isle of Man"                                 
[105] "India"                                       
[106] "British Indian Ocean Territory"              
[107] "Iraq"                                        
[108] "Iran"                                        
[109] "Iceland"                                     
[110] "Italy"                                       
[111] "Jersey"                                      
[112] "Jamaica"                                     
[113] "Jordan"                                      
[114] "Japan"                                       
[115] "Kenya"                                       
[116] "Kyrgyzstan"                                  
[117] "Cambodia"                                    
[118] "Kiribati"                                    
[119] "Comoros"                                     
[120] "Saint Kitts and Nevis"                       
[121] "North Korea"                                 
[122] "South Korea"                                 
[123] "Kuwait"                                      
[124] "Cayman Islands"                              
[125] "Kazakhstan"                                  
[126] "Laos"                                        
[127] "Lebanon"                                     
[128] "Saint Lucia"                                 
[129] "Liechtenstein"                               
[130] "Sri Lanka"                                   
[131] "Liberia"                                     
[132] "Lesotho"                                     
[133] "Lithuania"                                   
[134] "Luxembourg"                                  
[135] "Latvia"                                      
[136] "Libya"                                       
[137] "Morocco"                                     
[138] "Monaco"                                      
[139] "Moldova"                                     
[140] "Montenegro"                                  
[141] "Saint Martin"                                
[142] "Madagascar"                                  
[143] "Marshall Islands"                            
[144] "Macedonia"                                   
[145] "Mali"                                        
[146] "Myanmar [Burma]"                             
[147] "Mongolia"                                    
[148] "Macao"                                       
[149] "Northern Mariana Islands"                    
[150] "Martinique"                                  
[151] "Mauritania"                                  
[152] "Montserrat"                                  
[153] "Malta"                                       
[154] "Mauritius"                                   
[155] "Maldives"                                    
[156] "Malawi"                                      
[157] "Mexico"                                      
[158] "Malaysia"                                    
[159] "Mozambique"                                  
[160] "Namibia"                                     
[161] "New Caledonia"                               
[162] "Niger"                                       
[163] "Norfolk Island"                              
[164] "Nigeria"                                     
[165] "Nicaragua"                                   
[166] "Netherlands"                                 
[167] "Norway"                                      
[168] "Nepal"                                       
[169] "Nauru"                                       
[170] "Niue"                                        
[171] "New Zealand"                                 
[172] "Oman"                                        
[173] "Panama"                                      
[174] "Peru"                                        
[175] "French Polynesia"                            
[176] "Papua New Guinea"                            
[177] "Philippines"                                 
[178] "Pakistan"                                    
[179] "Poland"                                      
[180] "Saint Pierre and Miquelon"                   
[181] "Pitcairn Islands"                            
[182] "Puerto Rico"                                 
[183] "Palestine"                                   
[184] "Portugal"                                    
[185] "Palau"                                       
[186] "Paraguay"                                    
[187] "Qatar"                                       
[188] "Réunion"                                     
[189] "Romania"                                     
[190] "Serbia"                                      
[191] "Russia"                                      
[192] "Rwanda"                                      
[193] "Saudi Arabia"                                
[194] "Solomon Islands"                             
[195] "Seychelles"                                  
[196] "Sudan"                                       
[197] "Sweden"                                      
[198] "Singapore"                                   
[199] "Saint Helena"                                
[200] "Slovenia"                                    
[201] "Svalbard and Jan Mayen"                      
[202] "Slovakia"                                    
[203] "Sierra Leone"                                
[204] "San Marino"                                  
[205] "Senegal"                                     
[206] "Somalia"                                     
[207] "Suriname"                                    
[208] "South Sudan"                                 
[209] "São Tomé and Príncipe"                       
[210] "El Salvador"                                 
[211] "Sint Maarten"                                
[212] "Syria"                                       
[213] "Swaziland"                                   
[214] "Turks and Caicos Islands"                    
[215] "Chad"                                        
[216] "French Southern Territories"                 
[217] "Togo"                                        
[218] "Thailand"                                    
[219] "Tajikistan"                                  
[220] "Tokelau"                                     
[221] "East Timor"                                  
[222] "Turkmenistan"                                
[223] "Tunisia"                                     
[224] "Tonga"                                       
[225] "Turkey"                                      
[226] "Trinidad and Tobago"                         
[227] "Tuvalu"                                      
[228] "Taiwan"                                      
[229] "Tanzania"                                    
[230] "Ukraine"                                     
[231] "Uganda"                                      
[232] "U.S. Minor Outlying Islands"                 
[233] "United States"                               
[234] "Uruguay"                                     
[235] "Uzbekistan"                                  
[236] "Vatican City"                                
[237] "Saint Vincent and the Grenadines"            
[238] "Venezuela"                                   
[239] "British Virgin Islands"                      
[240] "U.S. Virgin Islands"                         
[241] "Vietnam"                                     
[242] "Vanuatu"                                     
[243] "Wallis and Futuna"                           
[244] "Samoa"                                       
[245] "Kosovo"                                      
[246] "Yemen"                                       
[247] "Mayotte"                                     
[248] "South Africa"                                
[249] "Zambia"                                      
[250] "Zimbabwe"                                    

26.11 Create a function to do the scraping

#####################################################
#
# Package up the technique into a function
#
#####################################################

getCountryData <- function() {
 
  
  # Get all of the country data and put it in a dataframe
  
  url = "https://scrapethissite.com/pages/simple/"
  
  the_full_html = read_html(url)
  
  countries = the_full_html %>%
    html_elements( ".country-name" ) %>%
    html_text()
  
  capitals = the_full_html %>%
    html_elements( ".country-capital" ) %>%
    html_text()
  
  population = the_full_html %>%
    html_elements( ".country-population" ) %>%
    html_text()
  
  area = the_full_html %>%
    html_elements( ".country-area" ) %>%
    html_text()

  # remove the spaces and newline characters from countries
  countries= gsub("\\n", "", countries)
  countries= gsub(" ", "" , countries)

  df = data.frame ( country=countries,
                    capital = capitals,
                    pop = population,
                    area = area,
                    stringsAsFactors = FALSE
  )
  
  return (df)
}


# Data on websites can change. We can now call this function whenever
# we want to get the latest versions of the data from the website.

mydata = getCountryData()
head(mydata)
             country          capital      pop     area
1            Andorra Andorra la Vella    84000    468.0
2 UnitedArabEmirates        Abu Dhabi  4975593  82880.0
3        Afghanistan            Kabul 29121286 647500.0
4  AntiguaandBarbuda       St. John's    86754    443.0
5           Anguilla       The Valley    13254    102.0
6            Albania           Tirana  2986952  28748.0

26.12 Scraping multiple webpages in a loop

26.12.1 Figure out the code

######################################################
#
# Getting data from multiple web pages in a loop.
#
######################################################


# The data on the following page is only one page of multiple pages of similar
# data. 
#
#   https://scrapethissite.com/pages/forms/?page_num=1
#
# The links to the other pages appear on the bottom of the page. Clicking on 
# the the link for the 2nd page, reveals that the 2nd page of data is at 
# the following URL:
#
#   https://scrapethissite.com/pages/forms/?page_num=2
#
# These URLs differ ONLY in the page number. This allows us to scrape
# ALL of the pages by reconstructing the URL for the page we want by
# inserting the correct page number. 


# CSS FROM INSPECT
css = "#hockey > div > table > tbody > tr:nth-child(2) > td.name"

# After looking at the full HTML code, we could see the the tbody
# in the selector should not have been there? There could be several 
# reasons (beyond the scope of today's class) as to why the tbody was 
# in the css selector that we got back. By looking at the full HTML
# and understanding a little about how HTML and css selectors work, we were
# able to realize that "tbody" didn't belong and take it out.
css = "#hockey > div > table > tr:nth-child(2) > td.name"

# When we tried the above css selector we ONLY got back the "Colorado Avalance"
# but no other team name. We quickly realized that was because of 
# :nth-child(2) that was limiting the results. We we took off :nth-child(2)
# that got us all of the team names. 
css = "#hockey > div > table > tr > td.name"

# The following is what we got from selector gadget
css = ".name"

# according to selector gadget this will match all data in the table
# and it does. However, the data comes back in a single vector 
# instead of a dataframe.
css = "td"

# Instead of getting all the data in a single vector it's probably better to
# get the Name column in one vector and the Year column in another vector
cssName = ".name"
cssYear = ".year"

url= "https://www.scrapethissite.com/pages/forms/?page_num=7"

fullPage = read_html(url)

fullPage %>%
  html_elements(css) %>%
  html_text2()
  [1] "Colorado Avalanche"      "1996"                   
  [3] "49"                      "24"                     
  [5] ""                        "0.598"                  
  [7] "277"                     "205"                    
  [9] "72"                      "Dallas Stars"           
 [11] "1996"                    "48"                     
 [13] "26"                      ""                       
 [15] "0.585"                   "252"                    
 [17] "198"                     "54"                     
 [19] "Detroit Red Wings"       "1996"                   
 [21] "38"                      "26"                     
 [23] ""                        "0.463"                  
 [25] "253"                     "197"                    
 [27] "56"                      "Edmonton Oilers"        
 [29] "1996"                    "36"                     
 [31] "37"                      ""                       
 [33] "0.439"                   "252"                    
 [35] "247"                     "5"                      
 [37] "Florida Panthers"        "1996"                   
 [39] "35"                      "28"                     
 [41] ""                        "0.427"                  
 [43] "221"                     "201"                    
 [45] "20"                      "Hartford Whalers"       
 [47] "1996"                    "32"                     
 [49] "39"                      ""                       
 [51] "0.39"                    "226"                    
 [53] "256"                     "-30"                    
 [55] "Los Angeles Kings"       "1996"                   
 [57] "28"                      "43"                     
 [59] ""                        "0.341"                  
 [61] "214"                     "268"                    
 [63] "-54"                     "Montreal Canadiens"     
 [65] "1996"                    "31"                     
 [67] "36"                      ""                       
 [69] "0.378"                   "249"                    
 [71] "276"                     "-27"                    
 [73] "New Jersey Devils"       "1996"                   
 [75] "45"                      "23"                     
 [77] ""                        "0.549"                  
 [79] "231"                     "182"                    
 [81] "49"                      "New York Islanders"     
 [83] "1996"                    "29"                     
 [85] "41"                      ""                       
 [87] "0.354"                   "240"                    
 [89] "250"                     "-10"                    
 [91] "New York Rangers"        "1996"                   
 [93] "38"                      "34"                     
 [95] ""                        "0.463"                  
 [97] "258"                     "231"                    
 [99] "27"                      "Ottawa Senators"        
[101] "1996"                    "31"                     
[103] "36"                      ""                       
[105] "0.378"                   "226"                    
[107] "234"                     "-8"                     
[109] "Philadelphia Flyers"     "1996"                   
[111] "45"                      "24"                     
[113] ""                        "0.549"                  
[115] "274"                     "217"                    
[117] "57"                      "Phoenix Coyotes"        
[119] "1996"                    "38"                     
[121] "37"                      ""                       
[123] "0.463"                   "240"                    
[125] "243"                     "-3"                     
[127] "Pittsburgh Penguins"     "1996"                   
[129] "38"                      "36"                     
[131] ""                        "0.463"                  
[133] "285"                     "280"                    
[135] "5"                       "San Jose Sharks"        
[137] "1996"                    "27"                     
[139] "47"                      ""                       
[141] "0.329"                   "211"                    
[143] "278"                     "-67"                    
[145] "St. Louis Blues"         "1996"                   
[147] "36"                      "35"                     
[149] ""                        "0.439"                  
[151] "236"                     "239"                    
[153] "-3"                      "Tampa Bay Lightning"    
[155] "1996"                    "32"                     
[157] "40"                      ""                       
[159] "0.39"                    "217"                    
[161] "247"                     "-30"                    
[163] "Toronto Maple Leafs"     "1996"                   
[165] "30"                      "44"                     
[167] ""                        "0.366"                  
[169] "230"                     "273"                    
[171] "-43"                     "Vancouver Canucks"      
[173] "1996"                    "35"                     
[175] "40"                      ""                       
[177] "0.427"                   "257"                    
[179] "273"                     "-16"                    
[181] "Washington Capitals"     "1996"                   
[183] "33"                      "40"                     
[185] ""                        "0.402"                  
[187] "214"                     "231"                    
[189] "-17"                     "Mighty Ducks of Anaheim"
[191] "1997"                    "26"                     
[193] "43"                      ""                       
[195] "0.317"                   "205"                    
[197] "261"                     "-56"                    
[199] "Boston Bruins"           "1997"                   
[201] "39"                      "30"                     
[203] ""                        "0.476"                  
[205] "221"                     "194"                    
[207] "27"                      "Buffalo Sabres"         
[209] "1997"                    "36"                     
[211] "29"                      ""                       
[213] "0.439"                   "211"                    
[215] "187"                     "24"                     
[217] "Calgary Flames"          "1997"                   
[219] "26"                      "41"                     
[221] ""                        "0.317"                  
[223] "217"                     "252"                    
[225] "-35"                    
# Get the name data
teamNames = fullPage %>%
  html_elements(cssName) %>%
  html_text2()

# Reminder about the %>% pipe symbold
# Same as above without using the %>% pipe symbol
html_text2(html_elements(fullPage, cssName))
 [1] "Colorado Avalanche"      "Dallas Stars"           
 [3] "Detroit Red Wings"       "Edmonton Oilers"        
 [5] "Florida Panthers"        "Hartford Whalers"       
 [7] "Los Angeles Kings"       "Montreal Canadiens"     
 [9] "New Jersey Devils"       "New York Islanders"     
[11] "New York Rangers"        "Ottawa Senators"        
[13] "Philadelphia Flyers"     "Phoenix Coyotes"        
[15] "Pittsburgh Penguins"     "San Jose Sharks"        
[17] "St. Louis Blues"         "Tampa Bay Lightning"    
[19] "Toronto Maple Leafs"     "Vancouver Canucks"      
[21] "Washington Capitals"     "Mighty Ducks of Anaheim"
[23] "Boston Bruins"           "Buffalo Sabres"         
[25] "Calgary Flames"         
# The %>% symbol takes the ootput of the command on the left and 
# sends it into the first argument of the command on the right.
rep(c(100,200), 3)
[1] 100 200 100 200 100 200
c(100,200) %>%
  rep(3)
[1] 100 200 100 200 100 200
teamNames
 [1] "Colorado Avalanche"      "Dallas Stars"           
 [3] "Detroit Red Wings"       "Edmonton Oilers"        
 [5] "Florida Panthers"        "Hartford Whalers"       
 [7] "Los Angeles Kings"       "Montreal Canadiens"     
 [9] "New Jersey Devils"       "New York Islanders"     
[11] "New York Rangers"        "Ottawa Senators"        
[13] "Philadelphia Flyers"     "Phoenix Coyotes"        
[15] "Pittsburgh Penguins"     "San Jose Sharks"        
[17] "St. Louis Blues"         "Tampa Bay Lightning"    
[19] "Toronto Maple Leafs"     "Vancouver Canucks"      
[21] "Washington Capitals"     "Mighty Ducks of Anaheim"
[23] "Boston Bruins"           "Buffalo Sabres"         
[25] "Calgary Flames"         
# Get the year data
teamYears = fullPage %>%
  html_elements(cssYear) %>%
  html_text2()

teamYears
 [1] "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996"
[11] "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996" "1996"
[21] "1996" "1997" "1997" "1997" "1997"
# now combine the values in a dataframe
dfTeamInfo = data.frame (name=teamNames, year=teamYears)
dfTeamInfo
                      name year
1       Colorado Avalanche 1996
2             Dallas Stars 1996
3        Detroit Red Wings 1996
4          Edmonton Oilers 1996
5         Florida Panthers 1996
6         Hartford Whalers 1996
7        Los Angeles Kings 1996
8       Montreal Canadiens 1996
9        New Jersey Devils 1996
10      New York Islanders 1996
11        New York Rangers 1996
12         Ottawa Senators 1996
13     Philadelphia Flyers 1996
14         Phoenix Coyotes 1996
15     Pittsburgh Penguins 1996
16         San Jose Sharks 1996
17         St. Louis Blues 1996
18     Tampa Bay Lightning 1996
19     Toronto Maple Leafs 1996
20       Vancouver Canucks 1996
21     Washington Capitals 1996
22 Mighty Ducks of Anaheim 1997
23           Boston Bruins 1997
24          Buffalo Sabres 1997
25          Calgary Flames 1997

26.13 Put it into a function

#-----------------------------------------------
# Let's see if we can get the first page of data
#-----------------------------------------------

# The following function will be useful to get rid of extra "whitespace"
# if necessary.

removeWhitespace = function(text){
  text = gsub("\\n","", text)
  text = gsub("\\t","", text )
  
  return ( gsub(" ", "", text))
}
  
x = "    this is some info \n\nanother line\nanother line\tafter a tab"
cat(x)
    this is some info 

another line
another line    after a tab
removeWhitespace(x)
[1] "thisissomeinfoanotherlineanotherlineafteratab"
# Create a function to scrape the hocky data from one of the 
# pages as specified in the url argument.
getHockeyData <- function(url) {

  the_full_html = read_html(url)
  
  teamNames = the_full_html %>%
    # Analyze the HTML from one of the pages to figure out which CSS selector
    # is best to use. We did so and figured out that the hockey team names
    # were surrounded with an HTML that had class="name". Therefore the
    # best css slector to use was ".name"
    html_elements( ".name" ) %>%     
    html_text()
  
  teamNames = removeWhitespace(teamNames)   # the team names seem to be surrounded by whitespace

  wins = the_full_html %>%
    # Analyze the HTML from one of the pages to figure out which CSS selector
    # is best to use. We did so and figured out that the number of wins
    # were surrounded with an HTML that had class="wins". Therefore the
    # best css slector to use was ".wins"
    html_elements( ".wins" ) %>%   # analyze the HTML to find the appropriate css selector
    html_text()
  
  wins = removeWhitespace(wins)

  # ... we can keep doing this to get all the other data on the page 
  # for each team ... We didn't do that here but feel free to fill in 
  # the missing code to scrape the rest of the data from the webpages.
  
  return(data.frame(team=teamNames, wins=wins, stringsAsFactors=FALSE))  
}

26.14 Some URLs includes info about the page number

page1Url = "https://scrapethissite.com/pages/forms/?page_num=20"
getHockeyData(page1Url)
                  team wins
1      LosAngelesKings   34
2        MinnesotaWild   40
3    MontrealCanadiens   41
4   NashvillePredators   40
5      NewJerseyDevils   51
6     NewYorkIslanders   26
7       NewYorkRangers   43
8       OttawaSenators   36
9   PhiladelphiaFlyers   44
10      PhoenixCoyotes   36
11  PittsburghPenguins   45
12       SanJoseSharks   53
13       St.LouisBlues   41
14   TampaBayLightning   24
15   TorontoMapleLeafs   34
16    VancouverCanucks   45
17  WashingtonCapitals   50
18        AnaheimDucks   39
19    AtlantaThrashers   35
20        BostonBruins   39
21       BuffaloSabres   45
22       CalgaryFlames   40
23  CarolinaHurricanes   35
24 ColumbusBlueJackets   32
25   ChicagoBlackhawks   52
#--------------------------------------------------------------------
# Let's figure out how to write a loop to get multiple pages of data
#--------------------------------------------------------------------

# The following is the url without the page number
baseurl = "https://scrapethissite.com/pages/forms/?page_num="

# get data for the first page
pageNum = "1"
url = paste0(baseurl, pageNum)
getHockeyData (url)
                  team wins
1         BostonBruins   44
2        BuffaloSabres   31
3        CalgaryFlames   46
4    ChicagoBlackhawks   49
5      DetroitRedWings   34
6       EdmontonOilers   37
7      HartfordWhalers   31
8      LosAngelesKings   46
9  MinnesotaNorthStars   27
10   MontrealCanadiens   39
11     NewJerseyDevils   32
12    NewYorkIslanders   25
13      NewYorkRangers   36
14  PhiladelphiaFlyers   33
15  PittsburghPenguins   41
16     QuebecNordiques   16
17       St.LouisBlues   47
18   TorontoMapleLeafs   23
19    VancouverCanucks   28
20  WashingtonCapitals   37
21        WinnipegJets   26
22        BostonBruins   36
23       BuffaloSabres   31
24       CalgaryFlames   31
25   ChicagoBlackhawks   36
# get data for the 2nd page
url=paste0(baseurl, "2")
getHockeyData(url)
                  team wins
1      DetroitRedWings   43
2       EdmontonOilers   36
3      HartfordWhalers   26
4      LosAngelesKings   35
5  MinnesotaNorthStars   32
6    MontrealCanadiens   41
7      NewJerseyDevils   38
8     NewYorkIslanders   34
9       NewYorkRangers   50
10  PhiladelphiaFlyers   32
11  PittsburghPenguins   39
12     QuebecNordiques   20
13       SanJoseSharks   17
14       St.LouisBlues   36
15   TorontoMapleLeafs   30
16    VancouverCanucks   42
17  WashingtonCapitals   45
18        WinnipegJets   33
19        BostonBruins   51
20       BuffaloSabres   38
21       CalgaryFlames   43
22   ChicagoBlackhawks   47
23     DetroitRedWings   47
24      EdmontonOilers   26
25     HartfordWhalers   26

26.15 Function to get multiple pages

# We can now write a function to get the data from multiple pages of the hockey data.
# The pages argument is expected to be a vector with the numbers of 
# the pages you want to retrieve.

getMultiplePages <- function(pages = 1:10){
  # This is the URL without the page number
  baseurl = "https://scrapethissite.com/pages/forms/?page_num="

  #   baseurl = "https://scrapethissite.com/pages/forms/?page_num=THE_PAGE_NUMBER&league=american
  
  
  # baseurl = "https://finance.yahoo.com/quote/<<<TICKER>>>?p=<<<TICKER>>>&.tsrc=fin-srch"
  
  allData = NA
  
  # Loop through all the pages
  for ( page in pages ){
    # Create the URL for the current page
    url = paste0(baseurl, page)
    
    # Get the data for the current page
    hd = getHockeyData(url)
    
    # Combine the data for the current page with all of the data
    # from the pages we have already retreived.
    
    if (!is.data.frame(allData)){
      # This will only happen for the first page of retrieved data.
      allData = hd
    } else {
      # This will happen for all pages other than the first page retrieved.
      # rbind will only work if allData is already a dataframe. Therefore
      # we cannot use rbind for the first page of retrieved data. 
      allData = rbind(allData, hd)
    }
    # We don't want to overwhelm the web server with too many requests
    # so we will pause (i.e. sleep) for 1 second after every time we
    # retrieve a page before getting the next page of data.
    Sys.sleep(1)
  }
  return(allData)
}

getMultiplePages(4:6)
                   team wins
1       FloridaPanthers   33
2       HartfordWhalers   27
3       LosAngelesKings   27
4     MontrealCanadiens   41
5       NewJerseyDevils   47
6      NewYorkIslanders   36
7        NewYorkRangers   52
8        OttawaSenators   14
9    PhiladelphiaFlyers   35
10   PittsburghPenguins   44
11      QuebecNordiques   34
12        SanJoseSharks   33
13        St.LouisBlues   40
14    TampaBayLightning   30
15    TorontoMapleLeafs   43
16     VancouverCanucks   41
17   WashingtonCapitals   39
18         WinnipegJets   24
19 MightyDucksofAnaheim   16
20         BostonBruins   27
21        BuffaloSabres   22
22        CalgaryFlames   24
23    ChicagoBlackhawks   24
24          DallasStars   17
25      DetroitRedWings   33
26       EdmontonOilers   17
27      FloridaPanthers   20
28      HartfordWhalers   19
29      LosAngelesKings   16
30    MontrealCanadiens   18
31      NewJerseyDevils   22
32     NewYorkIslanders   15
33       NewYorkRangers   22
34       OttawaSenators    9
35   PhiladelphiaFlyers   28
36   PittsburghPenguins   29
37      QuebecNordiques   30
38        SanJoseSharks   19
39        St.LouisBlues   28
40    TampaBayLightning   17
41    TorontoMapleLeafs   21
42     VancouverCanucks   18
43   WashingtonCapitals   22
44         WinnipegJets   16
45 MightyDucksofAnaheim   35
46         BostonBruins   40
47        BuffaloSabres   33
48        CalgaryFlames   34
49    ChicagoBlackhawks   40
50    ColoradoAvalanche   47
51          DallasStars   26
52      DetroitRedWings   62
53       EdmontonOilers   30
54      FloridaPanthers   41
55      HartfordWhalers   34
56      LosAngelesKings   24
57    MontrealCanadiens   40
58      NewJerseyDevils   37
59     NewYorkIslanders   22
60       NewYorkRangers   41
61       OttawaSenators   18
62   PhiladelphiaFlyers   45
63   PittsburghPenguins   49
64        SanJoseSharks   20
65        St.LouisBlues   32
66    TampaBayLightning   38
67    TorontoMapleLeafs   34
68     VancouverCanucks   32
69   WashingtonCapitals   39
70         WinnipegJets   36
71 MightyDucksofAnaheim   36
72         BostonBruins   26
73        BuffaloSabres   40
74        CalgaryFlames   32
75    ChicagoBlackhawks   34
# BE CAREFUL - the next line may take a little time to run
# allPages = getMultiplePages(1:24)
#
# Let's try it with just 2 pages for now ...
allPages = getMultiplePages(1:2)

26.16 Scrape the number of pages to get

# Figure out automatically how many pages of data there are on the website

url= "https://www.scrapethissite.com/pages/forms/?page_num=1"
fullPage = read_html(url)

# css that targets the 24 that represents the 24th page of data
css = "li:nth-child(24) a"
css = "li:last-child a"
css = ".pagination-area li:nth-last-child(2) a"

fullPage %>%
  html_elements(css) %>%
  html_text2()
[1] "24"

26.17 robots.txt file

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# DEFINITION: root directory (or root folder) ####
# 
# The "root" folder is the top level folder
# on a computer harddrive or on a website.
# It is named "/" on Mac and "\" on Windows.
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

# EXAMPLE
dir("/")   # show the files and folders in the root of the harddrive
 [1] "$Recycle.Bin"                "$SysReset"                  
 [3] "$WinREAgent"                 "Android_GoogleDriveWhatsApp"
 [5] "Brother"                     "Config.Msi"                 
 [7] "cygwin64"                    "Documents and Settings"     
 [9] "DumpStack.log"               "DumpStack.log.tmp"          
[11] "hiberfil.sys"                "hp"                         
[13] "hpswsetup"                   "Intel"                      
[15] "OneDriveTemp"                "pagefile.sys"               
[17] "PanoptoRecorder"             "PerfLogs"                   
[19] "Program Files"               "Program Files (x86)"        
[21] "ProgramData"                 "Python310"                  
[23] "Recovery"                    "rtools40"                   
[25] "SQL2019"                     "swapfile.sys"               
[27] "SWSetup"                     "System Volume Information"  
[29] "System.sav"                  "tmp"                        
[31] "tmp000350"                   "tmp2"                       
[33] "tmp3"                        "tmp4"                       
[35] "Users"                       "Windows"                    
# robots.txt ####
# 
# Websites may place a file named robots.txt
# at the very top of their website.
#
# Examples
#   https://finance.yahoo.com/robots.txt    # pretty simple
#   https://www.amazon.com/robots.txt       # very complex
#   https://www.gutenberg.org/robots.txt    # pretty typical
#
#
# Example entries in robots.txt
#     Disallow all robots to scrape the entire site
#     Since Disallow specifies the "root" or top level folder
#
#        User-agent: *
#        Disallow: /
#
#     Allow all robots to scrape the entire site
#     since "Disallow: " specifies no path
#
#        User-agent: *
#        Disallow: 
#
#     Sleep 10 seconds between each page request
#
#        crawl-delay: 10
#
# More info about robots.txt
#
#    Quick overview and a "test tool"
#      https://en.ryte.com/free-tools/robots-txt/
#
#    Ultimate robots.txt guide
#      https://yoast.com/ultimate-guide-robots-txt/
#
#    Another test tool
#      https://technicalseo.com/tools/robots-txt/
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

26.18 Techniques for finding CSS selectors

############################################################################.
# Techniques for finding CSS selectors
#
# 1. Analyze the entire HTML code.
#    Right-click on blank area of page and choose "view page source"
#    (or something similar)
#
# 2. Use the built in developer tools in most browsers.
#    Right click on the data you want to scrape and choose "inspect".
#    That brings you to the browser's developers tools. 
#    Right click on the HTML Tag you want to get a CSS selector for
#    and choose "Copy selector". Then paste the selector into 
#    a text editor (e.g. VSCode or RStudio's editor).
#    
#    Warning that these selectors will be too overly specific and
#    do not generalize well to scrape the same type of data from
#    the entire webpage.
#
#    For example the following css selector was retrieved in this way.
#    It gets a very specific position on the page. However, if the page
#    code changes in the future, this selector may stop to work. 
#    This selector is very "fragile":
#
#        css selector to find the last page 
#        #hockey > div > div.row.pagination-area > div.col-md-10.text-center > ul > li:nth-child(24) > a
#
# 3. Use the "selector gadget" Chrome extension. 
#    This is very helpful but is also not 100% guaranteed to work 
#    correctly.
############################################################################.

26.19 “scrape” the total number of pages.

############################################################################
#
# Write code to "scrape" the total number of pages.
#
# Find out how many pages there are (since the website may change, 
# we can't assume that there will always be the same number of pages)
#
###########################################################################


#..................................................................
# Find the CSS selector that pinpoints the number of the last page
#..................................................................

# Modern browsers have "web developer tools" built into them.
# These tools can help you to find the css selector that you need.
# Different browsers have slightly different tools. 
# For the following we will be using Chrome.
#
# Navigate to the first page ( https://scrapethissite.com/pages/forms/?page_num=1 )
#
# Right click on the last page number and choose "inspect". This will
# reveal a new window that shows the HTML for the page. The part of the HTML
# that corresponds to the page number is revealed. Make sure the page number
# is there ...
#
# Right click on the highlighted HTML and choose the menu choices :
# "Copy | copy selector". This will copy into the computer's clipboard 
# a CSS selector that will pinpoint that piece of info.
# 
# Paste the CSS selector into your R code. The following is the 
# css selector that I got when I did this:
#
#   #hockey > div > div.row.pagination-area > div.col-md-10.text-center > ul > li:nth-child(24) > a
#
# Note that this css selector is "fragile" - i.e. it is likely not to work
# in the future if more data is added to the stie. Therefore it is NOT a very 
# good CSS selector to use. We will examine this issue below. For now, let's 
# just use this selector and revisit this issue later.

url = "https://scrapethissite.com/pages/forms/?page_num=1"
cssSelector = "#hockey > div > div.row.pagination-area > div.col-md-10.text-center > ul > li:nth-child(24) > a"

# save the html for the page in a variable so we can use it 
# again later without needing to pull it down again from the website.
full_page = read_html(url)          

# Get last page number by searching the full_page html for the tags
# identified by the cssSelector and stripping out the tags
last_page_number <- full_page %>%
  html_elements( cssSelector ) %>%
  html_text()

last_page_number
[1] "\n                                    \n                                    24\n                                    \n                                "
# Strip out the whitespace 
last_page_number = removeWhitespace(last_page_number)
last_page_number
[1] "24"
# Convert the page nubmer to a numeric value
last_page_number = as.numeric(last_page_number)
last_page_number
[1] 24
# Now that it works, let's create a function that packages up this functionality.
# It's always a good idea to create a function that neatly packages up your
# working functionality.
#
# The function will take the "parsed" html code as an argument.

getLastPageNumber <- function(theFullHtml) {
  cssSelector = 
    "#hockey > div > div.row.pagination-area > div.col-md-10.text-center > ul > li:nth-child(24) > a"
  
  # remember that a function that does not have a return statement
  # will return the last value that is calculated.
  theFullHtml %>%
    html_elements( cssSelector ) %>%
    html_text() %>%
    removeWhitespace() %>%
    as.numeric()
}

last_page_number = getLastPageNumber(full_page)
last_page_number
[1] 24

26.20 Timing R code with Sys.time()

# Now that we have the total nubmer of pages, we can get ALL of the data
# for all the pages as follows.
#
# NOTE that this may take a while to run.
# We can figure out how much time by running all of the following commands
# together. 

Sys.time()    # shows the time on your computer
[1] "2024-04-16 12:47:28 EDT"
start = Sys.time()
# put some r code here
end = Sys.time()
end - start
Time difference of 0.0007381439 secs
start_time = Sys.time() # get the time before we start

dataFromAllPages = getMultiplePages(1:last_page_number)   # get all the pages

end_time = Sys.time()   # get the time when it ended

end_time - start_time   # show how long it took
Time difference of 30.53477 secs
# Let's examine the results ...
nrow(dataFromAllPages)  # how many rows?
[1] 582
head(dataFromAllPages, 4)  # first few rows ...
               team wins
1      BostonBruins   44
2     BuffaloSabres   31
3     CalgaryFlames   46
4 ChicagoBlackhawks   49
tail(dataFromAllPages, 4)  # last few rows ...
                  team wins
579  TorontoMapleLeafs   35
580   VancouverCanucks   51
581 WashingtonCapitals   42
582       WinnipegJets   37

26.20.1 Finding a css selector that isn’t “fragile”

#######################################################################
#
# Finding css selectors that aren't "fragile"
#
#######################################################################

#--------------------------------------------
# Quick review of CSS selector rules. 
#--------------------------------------------

# To review the CSS selector rules see the website 
#   https://flukeout.github.io/ 
#
# Read the tutorial information as well as try to solve the different 
# levels of the game. 
# The answers to the "game" on the above website can be found here:
#   https://gist.github.com/humbertodias/b878772e823fd9863a1e4c2415a3f7b6
#


#-----------------------------------------------
# Analyzing the CSS Selector we got from Chrome
# (There is an issue with it. 
#  We will address the issue and fix it.) 
#------------------------------------------------

# The following was the selector that we got from Chrome 
# to find the last page number:
#
#   #hockey > div > div.row.pagination-area > div.col-md-10.text-center > ul > li:nth-child(24) > a
#
# This selector is "fragile". It will work for now but might not work in the future. 
# This website has data for several years up until the present time. We 
# anticipate that in the future there will be additional pages of data.
# The CSS selector that we found will work to find the 24th page number
# (which is currently the last page). However, in the future if there are 
# more pages, the selector will continue to return the number 24. 
# To understand exactly why, let's analyze this selector:
# For a review of the rules for how selectors are built see the following 
# page: https://flukeout.github.io/
#

# <div>                # div is the parent of h1 and ul
#   <h1>My stuff</h1>  # h1 is the child of div and the sibling of ul
#   <ul>               # ul is the parent of 3 <li>s, the 2nd child of div and a sibling of h1
#     <li>             # 
#       <strong>       
#         Table
#       </strong></li>
#     <li>Chairs</li>
#     <li>Fork</li>
#   </ul>
# </div>


# The following breaks down the selector that Chrome figured out into its 
# different parts:
#
#   #hockey                   # find the HTML start tag whose id has a value id="hockey" 
#   >                         # directly inside of that element
#   div                       # there is a <div> tag
#   >                         # directly inside of that div tag
#   div.row.pagination-area   # there is a <div> tag that has row and pagination-area classes, i.e. <div class="row pagination-area">
#   >                         # directly inside of that
#   div.col-md-10.text-center # there is div tag with classes "col-md-10" and "text-center", i.e. <div class="col-md-10 text-center">
#   >                         # directly inside of that 
#   ul                        # there is a ul tag
#   >                         # directly inside that there is
#   li:nth-child(24)          # an <li> tag that is the 24th tag inside of the enclosing <ul> tag
#   >                         # directly inside that there is 
#   a                         # an "a" tag
#
# You should take the time to convince yourself that the css selector is 
# accurate today by looking carefully at the HTML code and noting exactly where the 
# last page number is in the context of the other tags and attributes on the page. 


#................................................
# Using VSCode (or another text ediotr) to analyze the HTML
#................................................

# It's much easier to navigate around the HTML code by analyzing the HTML in an HTML editor. 
# VSCode works well for this. To get the full HTML, right click on the page
# and choose "View page source" (in Chrome or a similar link in other browsers).
# Then copy all the code (ctrl-a on windows or cmd-a on Mac) and paste
# it into a new VSCode file. Then save the file with a .html extension
# to ensure that VSCode knows how to navigate around the file. 
# In VSCode you can now point to the left margin and press the arrows that appear
# to "collapse" or "expand" the HTML tags. Doing so helps a lot in trying to 
# understand how the HTML file is organized.
#
# Other useful features in VSCode to help with editing HTML:
#
# - alt-shift-F   
#    
#     Remember that HTML does not require proper indentation to work in the webpage.
#     However, without proper indentation, it is hard to read the HTML. Pressing
#     shift-alt-F will automatically "format" the entire HTML file so that it 
#     is indented properly and is easier to read.
#
# - As noted above - point to left margin to see arrows to 
#   collapse or expand HTML elements.
#
# - ctrl-shift-P
#
#     VSCode has many features that are not directly available through the
#     menus. ctrl-shift-P (Windows) or cmd-shift-P (mac) reveals the
#     "command palette". This is a quick way to search for commands
#     in VSCode which you might not otherwise know about. Try it. 
#     For example,
#
#     * press ctrl-shift-P and type "comment" (without the quotes). Then choose
#       "Add line comment" to automatically insert <!--   --> into your HTML file.
#       You can add comments to the HTML file as you are examining it. This may
#       help you when you are trying to figure out the structure of a 
#       complex HTML file and how to 
#    
#
#     * Highlight some text, press ctrl-shift-P. 
#       Then type "case" (without the quotes). You will see options to transform 
#       the selected text upper or lowercase. 
#
# - For more tips on how to use VSCode to edit HTML files: 
#   https://code.visualstudio.com/docs/languages/html
#
# - For more tips on how to use VSCode in general:
#   https://code.visualstudio.com/docs/getstarted/tips-and-tricks


#................................................
# ... back to analyzing the CSS selector
#................................................

# From analyzing this HTML, we determine that each page number is in an <li>
# element. Since there are 24 pages of data there are 24 <li> tags (each contains
# a link to one of the page numbers). There is also one more <li> tag that
# contains the "next page" button.     
#
# In the CSS Selector tha we got from Chrome "li:nth-child(24)" is specifically
# targeting the 24th <li> tag. That is the last tag today. However, 
# if in the future more pages of data are added to the website, this selector
# will still retrieve the 24th <li> tag and not the 2nd to last <li> tag
# (remember, the last <li> tag is for the "next page" button).
#
# (NOTE: the "next page" link appears in the HTML code
# as "next page" but is rendered in the browser as ">>". Replacing the
# words "next page" in the HTML with ">>" on the rendered page
# is accomplished through CSS rules - but this has no effect on  
# what we're trying to accomplish. When scraping a page, you work with what's 
# in the HTML file, not what is rendered in the browser.)
#
# Therefore it is important to analyze the HTML and try 
# to find a "less fragile" css selector the LAST page number the same information.
# After looking at the HTML code carefully, it seems like the <ul> tag
# referenced in the CSS selector that we found has a class="pagination" attribute, 
# ie. <ul class="pagination">. We can target that directly with ".pagination".
#
# The li that points to the last page happens to be the 2nd to last 
# <li> tag that is directly inside this ul tag, i.e. 
#
#    <ul class="pagination">
#       <li> info about page 1 </li>
#       <li> info about page 2 </li>
#         etc ...
#       <li> info about the last page </li>         # THIS IS THE <li> WE WANT
#       <li> info about the "next" button </li>
#    </ul>
#
# Our css selector therefore becomes :    ".pagination > li:nth-last-child(2)"
# Let's see if our code works with the new CSS selector

getLastPageNumber <- function(theFullHtml) {
  
  # Selector specfies the 
  # 2nd to last li inside the a tag that contains class="pagination"
  selector = ".pagination > li:nth-last-child(2)"    
  
  # remember that a function that does not have a return statement
  # will return the last value that is calculated.
  theFullHtml %>%
    html_elements( selector ) %>%
    html_text() %>%
    removeWhitespace() %>%
    as.numeric()
}

full_page = read_html("https://scrapethissite.com/pages/forms/?page_num=1")

getLastPageNumber(full_page)  # test the new version
[1] 24