python script to grab all images in a webpage

For the latest source visit GitHub.


import urllib2
import re
import os
from os.path import basename
from urlparse import urlsplit
from urlparse import urlparse
from posixpath import basename,dirname

## function that processes url, if there are any spaces it replaces with '%20' ##

def process_url(raw_url):
 if ' ' not in raw_url[-1]:
     raw_url=raw_url.replace(' ','%20')
     return raw_url
 elif ' ' in raw_url[-1]:
     raw_url=raw_url[:-1]
     raw_url=raw_url.replace(' ','%20')
     return raw_url

url='' ## give the url here
parse_object=urlparse(url)
dirname=basename(parse_object.path)
if not os.path.exists('images'):
    os.mkdir("images")
os.mkdir("images/"+dirname)
os.chdir("images/"+dirname)

urlcontent=urllib2.urlopen(url).read()
imgurls=re.findall('img .*?src="(.*?)"',urlcontent)
for imgurl in imgurls:
 try:
     imgurl=process_url(imgurl)
     imgdata=urllib2.urlopen(imgurl).read()
     filname=basename(urlsplit(imgurl)[2])
     output=open(filname,'wb')
     output.write(imgdata)
     output.close()
     os.remove(filename)
 except:
     pass

Limitations and Notes

  • Doesn’t Parse relative URLs
  • Script doesn’t iterate over pages, it just grabs all images from the given URL and writes into a folder.
  • Make sure that, there is an “images” folder in python directory if you want to run the code as it is. This is updated in the source.
  • It doesn’t check if a directory is already present, so be careful if you give the same URL twice it will bark at you. Simple solution is just delete the folder and try running the code again.
  • This script can be tweaked to do some dirty stuff.
  • In future versions I will try to make it do some more automated stuff, overcoming the limitations.

Thanks Goutham

Advertisements

One thought on “python script to grab all images in a webpage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s