How can I find out correct div, class, span when scraping a html page

I am new in Web scraping technology. I tried to implement Web scraping after reading various web tutorials like this and this. Those articles are about amazon web scraping and Netflix web scraping. There are lots of other tutorials on Imdb, Rotten Tomatoes and others. Those tutorials give me overview which attributes need to take like class attributes, div tags etc. Different websites have different methods to take those tags. However those tags are the fundamental elements of web scraping. When I follow those tutorials I can implement those codes but when I try to parse a different website other than the mentioned one I failed. Recently, I tried the code block over priceline. But I just messed up with so many html codes.

My code for price line

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url= 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=8848a774a531423bde3ed4ff3486f8bb'
r = requests.get(url, headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)

name=[]
hotel_div = soup.find_all('div', class_='Box-sc-8h3cds-0.Flex-sc-1ydst80-0.iNmVhl')
for container in hotel_div:
   name = d.find('span', attrs={'class':'Box-sc-8h3cds-0 Flex-sc-1ydst80-0 BadgeRow__BadgeContainer-fofgl-0 kmpPcP SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf'})
   n = name.find_all('img', alt=True)
   row={}
   if name is not None:
     #print(n[0]['alt'])
     row['Name'] = n[0]['alt']
   else:
      row['Name'] = "unknown-product"
print(name)

It returns an empty array.

Can any one suggest any tutorial or web blogs which help me to identify the correct html tags for any website?

Thank you for the help

2 answers

  • answered 2021-02-22 22:53 guipleite

    Each web developer will choose to name their classes and tags differently.

    To check how a new site is structured you can right click on what you want to scrape and then click on inspect and a tab should appear where you can find the tag, class name, etc

  • answered 2021-02-23 11:14 Haris

    import requests
    url = 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=384bcee7658bac00176bebdd4215edc0'    
    r = requests.get(url)
    print(r)
    

    Write the above piece of code in your command prompt or terminal. If you get the output <Response [403]> Then please refer to this post: Python requests. 403 Forbidden