Trying to web scrape a merged Html table with r

I'm trying to be able to scrape all the tables on any page of the Bureau of Labor Statistics (particularly this one: https://www.bls.gov/news.release/empsit.htm). However, I'm reaching an issue with R on a specific table on that site. The last table which is labelled as Table 7. Using the package rvest, I am using a recursive loop that will first create a matrix of tableIDs taken from the website, filtering out those that have no actual use, and then putting it into the html_table function(). Since there are merged cells, I left fill = TRUE on for the loop, and I also added extra conditions in order to account for some tables that actually have html table but still have the table id (Exhibit 4, 6, and 7). The issue lies in that for the final table, the second row actually does not have enough inputs for the amount of columns given and the rvest function fills it up in an odd way. The second line shouldn't have the 2016 between the May and Jun column and will mess up any future query that I do. Can anybody please help?

Output:

            Benchmark 2017 2016 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 Total
1           Benchmark 2017  Apr  May 2016  Jun  Jul  Aug  Sep  Oct  Nov  Dec  Jan  Feb   Mar
2   Actual Net Birth/Death  404  180   15  244  105  -38  255  -14  -35 -179   98   76 1,111
3 Forecast Net Birth/Death  255  231   99  154  113  -58  237    7  -17 -247  124   32   930
4               Difference  149  -51  -84   90   -8   20   18  -21  -18   68  -26   44   181
5    Cumulative Difference  149   98   14  104   96  116  134  113   95  163  137  181      

Code below:

webpage<- read_html("https://www.bls.gov/web/empsit/cesbmart.htm")
links <- html_nodes(webpage, "table")

titleMat <- bind_rows(lapply(xml_attrs(links), function(x) 
data.frame(as.list(x), stringsAsFactors=FALSE)))
tableExtract <- list()
tableNames <- array()
tableCap <- array()  
emptyArr <- array()

takeOut <- array()
counter <-0


for(i in 1:nrow(titleMat)){
  bool1 <- (titleMat[i,"class"] == "NA")

  if(is.na(bool1)){
    counter <- counter+1
    takeOut[counter] <- i
    }


}

tableID <- bind_rows(lapply(xml_attrs(links), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))[,"id"]
tableID
if (counter > 0){
tableID <- tableID[-c(takeOut)]
}

emptyCheck <- 0
for (cnt in 1:length(tableID)){


  capCheck <-webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("caption") %>% html_text()

  if (nchar(capCheck)>0){

    changedCap <-trimws(capCheck)

    tableCap[cnt] <- changedCap
  }



  thead <-webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("thead") %>% html_text()
  tbody <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("tbody") %>% html_text()
  tfoot <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("tfoot") %>% html_text()

  if( isTRUE(nchar(thead) > 0) || isTRUE(nchar(tbody) > 0) || isTRUE(nchar(tfoot) > 0)   ){

    tableExtract[[cnt]] <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_table(fill = TRUE) %>% .[[1]]
    tableExtract[[cnt]]
    tableNames[cnt] <- paste(tableID[cnt]," - v",sep="")
  }
  else{
    tableExtract[[cnt]] <- matrix("There are no recent updates for this table",1,1)
    tableNames[cnt] <- paste(tableID[cnt]," - v",sep="")

    emptyCheck <- emptyCheck + 1
    emptyArr[emptyCheck] <- cnt
  }
}

1 answer

  • answered 2018-02-13 07:51 Guillaume Ottavianoni

    You can try html_table function with FILL parameter :

    library(rvest)
    
    url <- "https://www.bls.gov/news.release/empsit.htm"
    page <- read_html(url)
    
    tables <- page %>% html_nodes("table")
    
    for (i in 1:length(tables)) {
     content <- try(tables[i] %>% html_table(fill=T))
       if( typeof(content) == 'list' ) content <- data.frame(content) else {
          content <- matrix(content)[[1]]
          content <- content[-c(1,length(content[,1])-1,length(content[,1])),]
       }
     assign(paste0("table_",i),content)
    }
    

    Hope that will help

    Gottavianoni