Check If Image URL Leads To Real Image in Python

So I am building a Python script to download images from a list of urls. The script works to an extent. I don't want it to download images that have urls that don't exist. I take care of a few images with some usage of status code but still get bad images. I still get many images that I don't want. Like these:

bad image enter image description here

Here is my code:

import os
import requests
import shutil
import random
import urllib.request

def sendRequest(url):
    try:
        page = requests.get(url, stream = True, timeout = 1)

    except Exception:
        print('error exception')
        pass

    else:
        #HERE IS WHERE I DO THE STATUS CODE
        print(page.status_code)
        if (page.status_code == 200):
            return page

    return False

def downloadImage(imageUrl: str, filePath: str):
    img = sendRequest(imageUrl)

    if (img == False):
        return False

    with open(filePath, "wb") as f:
        img.raw.decode_content = True

        try:
            shutil.copyfileobj(img.raw, f)
        except Exception:
            return False

    return True

os.chdir('/Users/nikolasioannou/Desktop')
os.mkdir('folder')

fileURL = 'http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04122825'
data = urllib.request.urlopen(fileURL)

output_directory = '/Users/nikolasioannou/Desktop/folder'

line_count = 0

for line in data:
    img_name = str(random.randrange(0, 10000)) + '.jpg'
    image_path = os.path.join(output_directory, img_name)
    downloadImage(line.decode('utf-8'), image_path)
    line_count = line_count + 1
#print(line_count)

Thanks for your time. Any ideas are appreciated.

Sincerely, Nikolas

1 answer

  • answered 2018-08-09 00:29 juliusmh

    you could check for the jpeg or png header and their respective magic sequence which is always a pretty good indicator for a valid image. Look at this so question.

    You can take al look at file signatures (aka magic numbers) here. You then just have to check the firs n bytes of response.raw

    I modified your sendRequest/download function a little bit, you should be able to hardcode more valid image file extensions than just the JPG magic number. I finally tested the code and it is working (on my machine). Only valid JPG images were saved. Note that i removed the stream=True flag because the images are so small you don't need to have a stream. And the saving gets a little bit less cryptic. Take a look:

    def sendRequest(url):
        try:
            page = requests.get(url)
    
        except Exception as e:
            print("error:", e)
            return False
    
        # check status code
        if (page.status_code != 200):
            return False
    
        return page
    
    def downloadImage(imageUrl: str, filePath: str):
        img = sendRequest(imageUrl)
    
        if (img == False):
            return False
    
        if not img.content[:4] == b'\xff\xd8\xff\xe0': return False
    
        with open(filePath, "wb") as f:
            f.write(img.content)
    
        return True
    

    You could also try to open the image using Pillow and BytesIO

    >>> from PIL import Image
    >>> from io import BytesIO
    
    >>> i = Image.open(BytesIO(img.content))
    

    and see if it throws an error. But the first solution seems more lightweight - you should not get any false positives there. You could also check for the string "<html>" in im.content and abort if it was found - this is very easy and probably very effective too.