Getting only the raw link using bs4 and requests

What I'm aiming to get is only the raw link which I can then use to download the image. but I keep getting some extra characters along with the link. from bs4 import BeautifulSoup import requests

from bs4 import BeautifulSoup
import requests

def getPages():
    x = 0
    url = 'https://readheroacademia.net/manga/boku-no-hero-academia-chapter-137/'
    req = requests.get(url)
    webpage = req.content
    soup = BeautifulSoup(webpage, 'html.parser')
    pages = soup.findAll('div', attrs={'class': 'acp_content'})
    for p in pages:
        y = p.findAll('img')
        print(y)
getPages()

What I end up getting looks like this:

[<img src="https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png"/>]

and I was hoping I could get something like this:

https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png

3 answers

  • answered 2018-07-11 02:44 Yang K

    If you want to get only the src, you can do:

    for p in pages:
        y = [tag["src"] for tag in p.findAll("img")]
        print(y)
    

    It gets the url out of each img tag instead of getting the whole tag.

    Also, if you're using bs4 or BeautifulSoup4, use find_all instead of findAll. findAll is bs3, the older version.

  • answered 2018-07-11 03:02 L--

    I think it will work:

    >>> from bs4 import BeautifulSoup
    >>> data = """<img src="https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png"/>"""
    >>> soap = BeautifulSoup(data,"lxml")
    >>> for i in soap.find_all("img"):
            link = i.get("src")
            print(link)
    

  • answered 2018-07-11 03:38 wp78de

    An alternative approach is to use XPath. I suggest using lxml here since there is no XPath support within Beautiful. This is actually a very simple solution:

    from lxml import html
    import requests
    
    page = requests.get('https://readheroacademia.net/manga/boku-no-hero-academia-chapter-137/')
    tree = html.fromstring(page.content)
    #This will create a list of img src attributes beneth the `<div id="acp_content" class="acp_content">` tag:
    srcs = tree.xpath('//div[@id="acp_content"]//img/@src')