Trouble parsing HTML with BeautifulSoup or golang colly

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):

enter image description here

I attempt to target the div.section_content:

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})

Printing the last line shows some other divs, but not the one with the pitching data.

However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):

>>> "Pitching" in soup.text
True

Here is an abbreviated version of one of the golang attempts:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
) 

func main() {
    c := colly.NewCollector(
            colly.AllowedDomains("www.baseball-reference.com"),
    )   
    c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
            fmt.Println(e.ChildText("div.section_content"))
    })  
    c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")

} }

1 answer

  • answered 2018-07-12 07:32 damd

    It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.

    For example:

    for element in soup(text=lambda text: isinstance(text, Comment)):
        comment = element.extract()
        comment_soup = BeautifulSoup(comment)
        # work with comment_soup