Getting only the raw link using bs4 and requests

What I'm aiming to get is only the raw link which I can then use to download the image. but I keep getting some extra characters along with the link. from bs4 import BeautifulSoup import requests

from bs4 import BeautifulSoup
import requests

def getPages():
    x = 0
    url = ''
    req = requests.get(url)
    webpage = req.content
    soup = BeautifulSoup(webpage, 'html.parser')
    pages = soup.findAll('div', attrs={'class': 'acp_content'})
    for p in pages:
        y = p.findAll('img')

What I end up getting looks like this:

[<img src=""/>]

and I was hoping I could get something like this:

3 answers

  • answered 2018-07-11 02:44 Yang K

    If you want to get only the src, you can do:

    for p in pages:
        y = [tag["src"] for tag in p.findAll("img")]

    It gets the url out of each img tag instead of getting the whole tag.

    Also, if you're using bs4 or BeautifulSoup4, use find_all instead of findAll. findAll is bs3, the older version.

  • answered 2018-07-11 03:02 M.r_L

    I think it will work:

    >>> from bs4 import BeautifulSoup
    >>> data = """<img src=""/>"""
    >>> soap = BeautifulSoup(data,"lxml")
    >>> for i in soap.find_all("img"):
            link = i.get("src")

  • answered 2018-07-11 03:38 wp78de

    An alternative approach is to use XPath. I suggest using lxml here since there is no XPath support within Beautiful. This is actually a very simple solution:

    from lxml import html
    import requests
    page = requests.get('')
    tree = html.fromstring(page.content)
    #This will create a list of img src attributes beneth the `<div id="acp_content" class="acp_content">` tag:
    srcs = tree.xpath('//div[@id="acp_content"]//img/@src')