Where do we put the "html.parser" argument when web scraping?
Look at the following snippet of code
import requests from bs4 import BeautifulSoup url = #Insert url here # Method 1 html = requests.get(url, "html.parser") soup = BeautifulSoup( html.text ) #Method 2 html2 = requests.get(url) soup2 = BeautifulSoup( html.text, "html.parser")
Which method is correct ? Method 1 or Method 2 ? Should we put "html.parser" in requests.get() or BeautifulSoup() ?
Parsers are not a part of HTTP request.
It's a method to parse different types of document. So, during parsing the html document using BeautifulSoup you have to mention the parser
So, method 2 is correct.
DocString of BeautifulSoup constructor
:param markup: A string or a file-like object representing markup to be parsed.
:param features: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.
If I understand correctly, your method 2 is correct and you would want to put it on the BeautifulSoup constructor because
- Requests is separate from Beautiful Soup and I don't believe putting the "html.parser" on the constructor will do anything
- You want to specify the parser for Beautiful Soup because it could be parsing things other than html e.g lxml's XML parser