Trying to web scrape a merged Html table with r
I'm trying to be able to scrape all the tables on any page of the Bureau of Labor Statistics (particularly this one: https://www.bls.gov/news.release/empsit.htm). However, I'm reaching an issue with R on a specific table on that site. The last table which is labelled as Table 7. Using the package rvest, I am using a recursive loop that will first create a matrix of tableIDs taken from the website, filtering out those that have no actual use, and then putting it into the html_table function(). Since there are merged cells, I left fill = TRUE on for the loop, and I also added extra conditions in order to account for some tables that actually have html table but still have the table id (Exhibit 4, 6, and 7). The issue lies in that for the final table, the second row actually does not have enough inputs for the amount of columns given and the rvest function fills it up in an odd way. The second line shouldn't have the 2016 between the May and Jun column and will mess up any future query that I do. Can anybody please help?
Output:
Benchmark 2017 2016 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 Total
1 Benchmark 2017 Apr May 2016 Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
2 Actual Net Birth/Death 404 180 15 244 105 -38 255 -14 -35 -179 98 76 1,111
3 Forecast Net Birth/Death 255 231 99 154 113 -58 237 7 -17 -247 124 32 930
4 Difference 149 -51 -84 90 -8 20 18 -21 -18 68 -26 44 181
5 Cumulative Difference 149 98 14 104 96 116 134 113 95 163 137 181
Code below:
webpage<- read_html("https://www.bls.gov/web/empsit/cesbmart.htm")
links <- html_nodes(webpage, "table")
titleMat <- bind_rows(lapply(xml_attrs(links), function(x)
data.frame(as.list(x), stringsAsFactors=FALSE)))
tableExtract <- list()
tableNames <- array()
tableCap <- array()
emptyArr <- array()
takeOut <- array()
counter <-0
for(i in 1:nrow(titleMat)){
bool1 <- (titleMat[i,"class"] == "NA")
if(is.na(bool1)){
counter <- counter+1
takeOut[counter] <- i
}
}
tableID <- bind_rows(lapply(xml_attrs(links), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))[,"id"]
tableID
if (counter > 0){
tableID <- tableID[-c(takeOut)]
}
emptyCheck <- 0
for (cnt in 1:length(tableID)){
capCheck <-webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("caption") %>% html_text()
if (nchar(capCheck)>0){
changedCap <-trimws(capCheck)
tableCap[cnt] <- changedCap
}
thead <-webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("thead") %>% html_text()
tbody <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("tbody") %>% html_text()
tfoot <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_nodes("tfoot") %>% html_text()
if( isTRUE(nchar(thead) > 0) || isTRUE(nchar(tbody) > 0) || isTRUE(nchar(tfoot) > 0) ){
tableExtract[[cnt]] <- webpage %>% html_nodes(paste("#",tableID[cnt],sep="")) %>% html_table(fill = TRUE) %>% .[[1]]
tableExtract[[cnt]]
tableNames[cnt] <- paste(tableID[cnt]," - v",sep="")
}
else{
tableExtract[[cnt]] <- matrix("There are no recent updates for this table",1,1)
tableNames[cnt] <- paste(tableID[cnt]," - v",sep="")
emptyCheck <- emptyCheck + 1
emptyArr[emptyCheck] <- cnt
}
}
1 answer
-
answered 2018-02-13 07:51
Guillaume Ottavianoni
You can try html_table function with FILL parameter :
library(rvest) url <- "https://www.bls.gov/news.release/empsit.htm" page <- read_html(url) tables <- page %>% html_nodes("table") for (i in 1:length(tables)) { content <- try(tables[i] %>% html_table(fill=T)) if( typeof(content) == 'list' ) content <- data.frame(content) else { content <- matrix(content)[[1]] content <- content[-c(1,length(content[,1])-1,length(content[,1])),] } assign(paste0("table_",i),content) }
Hope that will help
Gottavianoni