Saturday, February 26, 2011

XKCD Fetcher

Problem:  You want every XKCD in existence on your harddrive.  Solution: Pirate this script!

Instead of going through and raping the XKCD website, they have provided a nice JSON document for every comic.  The script below can produce two outputs, one is a nice list of comics that includes number, title, and alt-text, that looks like this:



  • 405 - Journal 3 - Oh, and, uh, if the Russian government asks, that submarine was always there.
  • 406 - Venting - P.P.S. I can kill you with my brain.
  • 407 - Cheap GPS - In lieu of mapping software, I once wrote a Perl program which, given a USB GPS receiver and a destination, printed 'LEFT' 'RIGHT' OR 'STRAIGHT' based on my heading.
  • 408 - Overqualified - To anyone I've taken on a terrible date, this is retroactively my cover story.
  • 409 - Electric Skateboard (Double Comic) - Unsafe vehicles, hills, and philosophy go hand in hand.
  • 410 - Math Paper - That's nothing. I once lost my genetics, rocketry, and stripping licenses in a single incident.

The other will just print a list of URLs to stdout, that you could write to a file or possibly pipe to a wget script.  To use this function just invoke the program with a -w option.

One caveat, this script will only fetch comics until the second number in the range statement, so change it before running. This script is licensed under an Apache/GPL/MIT/Other OSI approved license, just be sure to attribute me, thanks :) Remember that XKCD comics are CCbyNC so you are free to copy them to your heart's content.

#!/usr/bin/env python
#Copyright 2011 Joseph Lewis <joehms22 gmail com>
#MIT License/GPL 2+ License/Apache License

import urllib2
import time
import sys

WGET = False
try:
if sys.argv[1] == "-w":
WGET = True
except IndexError:
pass

for i in range(1, 850):
if i == 404: #Skip the 404th comic which is an error 404 page.
continue
url = "http://xkcd.com/%i/info.0.json" % (int(i))
comic_json = urllib2.urlopen(url)
time.sleep(.2)
comic = eval( comic_json.read() )

number = comic['num']
title = comic['title']
image = comic['img']
alt = comic['alt']

if WGET: #Just list urls for wget :)
print(image)
else:
print("<li><a href='%s'>%s - %s</a> - %s</li>\n" % (image, number, title, alt))

No comments:

Post a Comment