Scraping dataTable with rvest by id, doesn't find table

I'm trying to scrape data from the datatable here, calling by xpath's id:

library(rvest)
library(dplyr)

url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  

h <- url %>% read_html() 

h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()

The last command gives me this error:

Error in matrix(NA_character_, nrow = n, ncol = maxp) : 
  invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
  NAs introduced by coercion to integer range

What I'm I missing here?

2 answers

  • answered 2020-11-25 05:16 stevec

    You actually already had it with just

    library(rvest)
    library(dplyr)
    
    url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  
    
    h <- url %>% read_html() 
    
    h %>% 
      html_nodes(xpath = "//*[@id='qs-rankings-indicators']")
    
    {xml_nodeset (1)}
    [1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>
    

    i.e. without the last %>% html_table()

    The reason there's no data inside the table is because it's loaded with javascript after the initial HTML page load.

    To get the table including the javascript loaded content, you'll need to use a scraping tool that can run the website's javascript (I would recommend RSelenium)

  • answered 2020-11-25 05:23 ekoam

    That table is rendered by javascript. Perhaps just get the JSON data directly from the source. Try something like this

    tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
    url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"
    
    res <- 
      jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
        "rank_display", "score", "title", "country", "region"
      )]
    

    Output

    > head(res)
      rank_display score                                        title        country        region
    1            1   100  Massachusetts Institute of Technology (MIT)  United States North America
    2            2  98.7                          Stanford University  United States North America
    3            3  98.4                           Harvard University  United States North America
    4            4  97.7 California Institute of Technology (Caltech)  United States North America
    5            5  95.6                      University of Cambridge United Kingdom        Europe
    6            6  95.3                         University of Oxford United Kingdom        Europe