Scrapy: Check if response is an image

I need check if response is an image.

For requirements of the work I need to generate the url of the photos that can exist or no and record the url that contains an image.

When the url generated doesn't show a photo the response of the website is an html when the body is:

<body>No File Found</body> 

also the response.status =200

The response header doesn't have a valuable info for both results with image and No File Found

For instance
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Transfer-Encoding: chunked
Expires: 0
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
X-Frame-Options: AllowAll
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: *
Date: Tue, 13 Aug 2019 01:44:40 GMT

The way that I found to check if the response is an image for this case was:

        try :
            no_file_found = response.xpath("/html/body[contains(., 'No File Found')]")
            photo_url = response.url
            photo = PhotoItem()

            photo['id'] = id
            photo['url'] = photo_url

            yield photo

Because When the response is an image the line

no_file_found = response.xpath("/html/body[contains(., 'No File Found')]")

throw this exception:

raise NotSupported("Response content isn't text")

I know that this isn't an elegant solution , but for this context it works


My question is If there is another way more elegant to solve this problem, that not use try to solve that.

Notice that I don't need to download the image just need to record the valid url

Any suggestion is welcome.

Thanks in advance!!!

1 answer

  • answered 2019-08-13 09:41 stranac

    The simplest way would probably be to just check the type of the response:

    from scrapy.http.response.text import TextResponse
    if not isinstance(response, TextResponse):
        # it's probably an image; do image stuff