Quickly testing websites using the check_site spider

As of a few minutes ago, this site is running the bleeding-edge django-mingus. A fair number of things changed since the last release and it's handy to be able to exercise the entire site quickly to make sure everything's working correctly through the entire stack from Webfaction's front-end proxy down to the actual django application. This provided a good excuse to plug one of the newest utilities in my webtoolbox:

check_site is a simple spider, based on an easily-extensible Spider class, which will walk an entire site and report any errors you find. The entire process would look something like this, assuming that you have virtualenv, virtualenvwrapper and pip available:

chris@Saturn:~/Development/webtoolbox $ git clone http://github.com/acdha/webtoolbox.git

Initialized empty Git repository in /private/tmp/webtoolbox/.git/

chris@Saturn:~/Development/webtoolbox $ mkvirtualenv webtoolbox

New python executable in webtoolbox/bin/python

Installing setuptools............done.

(webtoolbox)chris@Saturn:~/Development/webtoolbox cd webtoolbox/

(webtoolbox)chris@Saturn:~/Development/webtoolbox [git master] $ add2virtualenv .

(webtoolbox)chris@Saturn:~/Development/webtoolbox [git master] $ pip install -r requirements.pip

… time passes …

(webtoolbox)chris@Saturn:~/Development/webtoolbox [git master] $ ./bin/check_site.py http://chris.improbable.org/ --max-connections=2

[QASpider] [WARNING]: http://chris.improbable.org/2008/07/12/iphone-os-20-the-good-bad-and-very-ugly/: stripped 1 non-printable control characters

[QASpider] [WARNING]: http://chris.improbable.org/2009/02/3/in-which-the-gop-surrenders-any-pretense-of/: stripped 3 non-printable control characters

[QASpider] [WARNING]: http://chris.improbable.org/2008/04/17/dinosaur-meet-tar-pit/: stripped 1 non-printable control characters

[QASpider] [WARNING]: http://chris.improbable.org/2007/10/19/textmate-and-php-automatic-syntax-checking-when/: stripped 4 non-printable control characters

[QASpider] [WARNING]: http://chris.improbable.org/2007/07/4/efficiency/: stripped 2 non-printable control characters

[QASpider] [WARNING]: http://chris.improbable.org/2007/07/18/in-praise-of-simple-solutions/: stripped 4 non-printable control characters

Site Report for chris.improbable.org

Retrieved 271 URLs in 28.31 seconds with 0 errors

That's pretty easy and HTML validation is also available. If you need to add custom checks,  the core spider is pretty simple and can easily be extended to add whatever custom logic you might want. In the meantime, it looks like I have to clean some control-codes which I imported from the old legacy PHP code which used to run this site…


