Check If Image URL Leads To Real Image in Python
So I am building a Python script to download images from a list of urls. The script works to an extent. I don't want it to download images that have urls that don't exist. I take care of a few images with some usage of status code but still get bad images. I still get many images that I don't want. Like these:
Here is my code:
import os import requests import shutil import random import urllib.request def sendRequest(url): try: page = requests.get(url, stream = True, timeout = 1) except Exception: print('error exception') pass else: #HERE IS WHERE I DO THE STATUS CODE print(page.status_code) if (page.status_code == 200): return page return False def downloadImage(imageUrl: str, filePath: str): img = sendRequest(imageUrl) if (img == False): return False with open(filePath, "wb") as f: img.raw.decode_content = True try: shutil.copyfileobj(img.raw, f) except Exception: return False return True os.chdir('/Users/nikolasioannou/Desktop') os.mkdir('folder') fileURL = 'http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04122825' data = urllib.request.urlopen(fileURL) output_directory = '/Users/nikolasioannou/Desktop/folder' line_count = 0 for line in data: img_name = str(random.randrange(0, 10000)) + '.jpg' image_path = os.path.join(output_directory, img_name) downloadImage(line.decode('utf-8'), image_path) line_count = line_count + 1 #print(line_count)
Thanks for your time. Any ideas are appreciated.
you could check for the jpeg or png header and their respective magic sequence which is always a pretty good indicator for a valid image. Look at this so question.
You can take al look at file signatures (aka magic numbers) here. You then just have to check the firs
I modified your sendRequest/download function a little bit, you should be able to hardcode more valid image file extensions than just the JPG magic number. I finally tested the code and it is working (on my machine). Only valid JPG images were saved. Note that i removed the stream=True flag because the images are so small you don't need to have a stream. And the saving gets a little bit less cryptic. Take a look:
def sendRequest(url): try: page = requests.get(url) except Exception as e: print("error:", e) return False # check status code if (page.status_code != 200): return False return page def downloadImage(imageUrl: str, filePath: str): img = sendRequest(imageUrl) if (img == False): return False if not img.content[:4] == b'\xff\xd8\xff\xe0': return False with open(filePath, "wb") as f: f.write(img.content) return True
You could also try to open the image using Pillow and BytesIO
>>> from PIL import Image >>> from io import BytesIO >>> i = Image.open(BytesIO(img.content))
and see if it throws an error. But the first solution seems more lightweight - you should not get any false positives there. You could also check for the string
im.contentand abort if it was found - this is very easy and probably very effective too.