Scraping .aspx page with Python yields 404

I'm a web-scraping beginner and am trying to scrape this webpage: https://profiles.doe.mass.edu/statereport/ap.aspx

I'd like to be able to put in some settings at the top (like District, 2020-2021, Computer Science A, Female) and then download the resulting data for those settings.

Here's the code I'm currently using:

import requests
from bs4 import BeautifulSoup

url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
    r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
    soup = BeautifulSoup(r.text,"lxml")
    data = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    
    
    data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
    data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
    data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
    data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
    
    p = s.post(url,data=data)

When I print out p.text, I get a page with title '\t404 - Page Not Found\r\n' and message

<h2>We are unable to locate information at: <br /><br '
 '/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'

Here's what data looks like before I modify it:

{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
 '__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=',
 '__VIEWSTATEGENERATOR': '2B6F8D71',
 'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
 'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
 'leftNavId': '11241',
 'quickSearchValue': '',
 'runQuickSearch': 'Y',
 'searchType': 'QUICK',
 'searchtext': ''}

Following suggestions from similar questions, I've tried playing around with the parameters, editing data in various ways (to emulate the POST request that I see in my browser when I navigate the site myself), and specifying an ASP.NET_SessionId, but to no avail.

How can I access the information from this website?

2 answers

  • answered 2022-05-07 15:33 Flow

    This should be what you are looking for what I did was use bs4 to parse HTML data and then found the table. Then I get the rows and to make it easier to work with the data I put it into a dictionary.

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
    with requests.Session() as s:
        s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
        r = s.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        table = soup.find_all('table')
        rows = table[0].find_all('tr')
        data = {}
        for row in rows:
            if row.find_all('th'):
                keys = row.find_all('th')
                for key in keys:
                    data[key.text] = []
            else:
                values = row.find_all('td')
                for value in values:
                    data[keys[values.index(value)].text].append(value.text)
    
    for key in data:
        print(key, data[key][:10])
        print('\n')
    
    

    The output:

    District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']
    
    
    District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']
    
    
    Tests Taken ['     100', '     109', '   1,070', '     504', '     209', '     126', '     178', '     986', '     893', '      97']
    
    
    Score=1 ['      16', '      81', '      12', '      29', '      27', '      18', '       5', '      70', '      72', '       4']
    
    
    Score=2 ['      31', '      20', '      55', '      74', '      65', '      34', '      22', '     182', '     149', '      23']
    
    
    Score=3 ['      37', '       4', '     158', '     142', '      55', '      46', '      37', '     272', '     242', '      32']
    
    
    Score=4 ['      15', '       3', '     344', '     127', '      39', '      19', '      65', '     289', '     270', '      22']
    
    
    Score=5 ['       1', '       1', '     501', '     132', '      23', '       9', '      49', '     173', '     160', '      16']
    
    
    % Score 1-2 ['  47.0', '  92.7', '   6.3', '  20.4', '  44.0', '  41.3', '  15.2', '  25.6', '  24.7', '  27.8']
    
    
    % Score 3-5 ['  53.0', '   7.3', '  93.7', '  79.6', '  56.0', '  58.7', '  84.8', '  74.4', '  75.3', '  72.2']
    
    
    
    Process finished with exit code 0
    
    

  • answered 2022-05-07 18:48 perigon

    I was able to get this working by adapting the code from here. I'm not sure why editing the payload in this way made the difference, so I'd be grateful for any insights!

    Here's my working code, using Pandas to parse out the tables:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
    with requests.Session() as s:
        s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
        
        response = s.get(url)
        soup = BeautifulSoup(response.content, 'html5lib')
    
        data = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=ctl00]') if tag.get('value')
                }
        state = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=__]')
                }
        
        payload = data.copy()
        payload.update(state)
        
        payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
        payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
        payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
        payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
        
        p = s.post(url,data=payload)
        df = pd.read_html(p.text)[0]
        
        df["District Code"] = df["District Code"].astype(str).str.zfill(8)
        display(df)
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum