Data scrapping from table on website

I need help on extracting or scrap data from table on webpage. I am using beautiful soup. Unable to extract table with Compliance information. Anyhelp would be appreciated:

All rows data from table - Compliance information is needed. There are several tables within a single webpage but I need data for only Compliance information don't know how to do it.

URL is given here

My code is given below:

link = ["http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=&registryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="]

for pagenum, links in enumerate(link[start:end]):

  print(links)
  r = requests.get(links)

  time.sleep(random.randint(2,5)) 

  soup = BeautifulSoup(r.content,"lxml")

  tree = html.fromstring(str(soup))

  value = []

  data_block = soup.find_all("table", {"class": "bordertb"})

  print (data_block)

  output = []

  for item in data_block:

    table_data = item.find_all("td", {"class": "tabletitle"})[0].table

    value.append([table_data])

    print (value)


  with open("Exhibit_2_EXP_data.tsv", "wb") as outfile:

    outfile = unicodecsv.writer(outfile, delimiter="\t")

   outfile.writerow(["Data_Output"])

   for item in value:

     outfile.writerow(item)

1 answer

  • answered 2018-05-16 08:06 SIM

    Try this. The below script should fetch you the content from that table. To make it specific you should start your operation from the previous table (as it has got a unique ID) then using the appropriate method you can reach the content of your desired table. Here is what I did to achieve the same:

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=&registryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="
    
    r = requests.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.find(id="tblInstallationContacts").find_next_sibling().find_all("tr")[:-5]:
        data = [item.get_text(strip=True) for item in items.find_all("td")]
        print(data)