Index Error: list out of range on html table web-scrape - Python

I appreciate this has been asked many times but I've been stuck here for quite a while.

I'm trying to take all the data from a table on a website and put it into a pandas dataframe.

I've written the code to do the web scraping but for some reason i'm getting the error whilst trying to write to my variable.

import requests
import requests
url = 'http://www.londonstockexchange.com/exchange/prices/stocks/summary/fundamentals.html?fourWayKey=GB00BCDBXK43GBGBXASX1'

page = requests.get(url).text

from bs4 import BeautifulSoup

soup = BeautifulSoup(page)

# print(soup.prettify())

all_tables = soup.find_all('table')

right_table = soup.find_all('table', {'class':'table_dati'})
tbl1 = right_table[0]

A = []
B = []
C = []
D = []
E = []
F = []

for row in tbl1.find_all('tr'):
  cells = row.find_all('td')
  A.append(cells[0].find(text = True))
  B.append(cells[1].find(text = True))
  C.append(cells[2].find(text = True))
  D.append(cells[3].find(text = True))
  E.append(cells[4].find(text = True))
  F.append(cells[5].find(text = True))

Here's the error:

A.append(cells[0].find(text = True))

IndexError: list index out of range

Appreciate the help, Thanks

1 answer

  • answered 2018-03-13 23:39 Alex

    Well, if you see the html code, your first iteration doesn't have td (is the thead), so when you are trying to get the first element, it doesn't exist, because cells are empty.

    This is the first row:

    <tr>
       <th class="name">Income Statement</th>
       <th>
          31-May-13 <br>( £
          m&nbsp;)
       </th>
       <th>
          31-May-14 <br>( £
          m&nbsp;)
       </th>
       <th>
          31-May-15 <br>( £
          m&nbsp;)
       </th>
       <th>
          31-May-16 <br>( £
          m&nbsp;)
       </th>
       <th>
          31-May-17 <br>( £
          m&nbsp;)
       </th>
    </tr>
    

    You can surround with try, except, or select the tbody.

    Based on your code, you can add to the find_all() a list of tags, and then jump when the length of the cells list is less than 6, but for the future it's better to try to create lists dynamically, instead of everything being fixed.

    for row in tbl1.find_all('tr'):
        try:
            cells = row.find_all(['td', 'th'])
            if len(cells) < 6:
                continue
            A.append(cells[0].find(text = True).strip())
            B.append(cells[1].find(text = True).strip())
            C.append(cells[2].find(text = True).strip())
            D.append(cells[3].find(text = True).strip())
            E.append(cells[4].find(text = True).strip())
            F.append(cells[5].find(text = True).strip())
        except Exception as e:
            print(e)
    print(A)
    

    The output is:

    [
      "Income Statement",
      "Revenue",
      "Operating Profit/(Loss)",
      "Net Interest",
      "Profit Before Tax",
      "Profit After Tax",
      "Profit After Tax",
      "PROFIT FOR THE PERIOD",
      "Minority Interests",
      "Equity Holders of Parent Company",
      "Earnings per Share - Basic",
      "Earnings per Share - Diluted",
      "Earnings per Share - Adjusted",
      "Earnings per Share - Basic",
      "Earnings per Share - Diluted",
      "Earnings per Share - Adjusted",
      "Dividend per Share"
    ]