Where do we put the "html.parser" argument when web scraping?

Look at the following snippet of code

import requests
from bs4 import BeautifulSoup
url = #Insert url here

# Method 1
html = requests.get(url, "html.parser")
soup = BeautifulSoup( html.text )

#Method 2
html2 = requests.get(url)
soup2 = BeautifulSoup( html.text, "html.parser")

Which method is correct ? Method 1 or Method 2 ? Should we put "html.parser" in requests.get() or BeautifulSoup() ?

2 answers

  • answered 2020-08-11 00:39 bigbounty

    Parsers are not a part of HTTP request.

    It's a method to parse different types of document. So, during parsing the html document using BeautifulSoup you have to mention the parser

    So, method 2 is correct.

    DocString of BeautifulSoup constructor

    :param markup: A string or a file-like object representing markup to be parsed.

    :param features: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.

  • answered 2020-08-11 00:42 Jeremy Farmer

    If I understand correctly, your method 2 is correct and you would want to put it on the BeautifulSoup constructor because

    1. Requests is separate from Beautiful Soup and I don't believe putting the "html.parser" on the constructor will do anything
    2. You want to specify the parser for Beautiful Soup because it could be parsing things other than html e.g lxml's XML parser

    Beautiful Soup Docs