Dec 24

Web Toolbox

I've collected some helpful tools for web developers

I've built collected various website testing tools into a webtoolbox repository on Github. This includes my earlier red_spider work as well as a few other utilities which have come in handy:

red_spider

A spider based on Mark Nottingham's redbot: it will produce a nice HTML report of page cacheability and, optionally, HTML validation; since it uses the same nbhttp library it's pretty fast, too. There are a number of options for filtering and it allows you to save lists of page and media URLs for use with tools like wk-bench or tornado-bench.

log_replay

If you need to replace webserver log files at something approximating realtime, log_replay is your friend. It uses Tornado's non-blocking HTTP client (based on pycurl - at some point it would be good to refactor down to just that) to fetch all of the URLs but will sleep any time it's too far ahead of the simulated virtual time.

tornado-bench

Also uses Tornado's non-blocking HTTP client, this program simply takes a big list of URLs and simply retrieves them as quickly as possible.

wk-bench

Mac OS X-specific tool which measures user-perceived page-load performance. It uses PyObjC to load a full WebKit browser, processes a list of URLs and reports the time taken from beginning the HTTP request until the browser fires the didFinishLoadForFrame event, which includes things like image loading, Flash, JavaScript, etc. This is also useful for reporting JavaScript errors as they are logged to the console and can very easily be extracted for verifying that you don't have on-load errors site-wide.

Dec 05

The view from our window

The first snow of the season finally arrived:

Now to see how much sweeter our kale gets after the frost:

Oct 16

Deploying Django Sites using RPM

How I've been deploying Django sites using RPM and virtualenv at work

Why?

Django is a great framework for developing websites but as with most projects there isn’t a particular focus on the system administration side of running a real site. There are great instructions describing the source-level changes you’ll want to make and what you’ll need to configure your webserver to do but … what about afterwards?

The process of deploying any site has a few basic steps: update the code, apply any database changes and reload the running site. How people choose to do this varies wildly but in the Python world it tends to involve a lot of manual work setting up the Python environment and running commands by hand or using a tool such as Fabric or Buildout to run those commands for you.

This approach works but it has a few drawbacks:

  • If you’re installing anything which uses a native library (images, databases, clients for things like memcache, etc.) you’ll need install a ton of extra dependencies on your production servers: gcc, development headers, etc.
  • It’s extremely slow compared to normal Linux software installation
  • You’re exposed to failures in outside services such as PyPI, Sourceforge, Github, Bitbucket, etc. This can be a problem if you can’t download the package and disastrous if the upstream source has updated to a newer version which you haven’t tested
  • Adding a second server requires you to duplicate everything, which is time consuming, and you then have to apply your changes in lockstep across every server to avoid requests being processed depending on which server handles the request
  • It’s hard to tell in advance what you’ll need to install an application unless you tediously compare the installed dependencies on a known-working server
  • You can have silent failures which only show up in production - the classic example being the Python Imaging Library, which can install “successfully” without optional components like JPEG support if it fails to detect them, which will only show up the first time someone attempts to use a JPEG file with your site.

None of these are new problems - in fact, the BSD and Linux communities have been working on unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

Possible causes are:

If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

managers for the last decade or two, which is why you can install a brand new Linux system with hundreds of applications in less time than it takes to bring a large Python website up on a new server.

How

UPDATE: This has turned into a Github project and the latest version of these instructions are on Github pages.

Structuring the application

A well-behaved application is going to do a few things:

  • Install in a well-defined location not used by other applications - i.e. /opt/my_app rather than /var/www/html.
  • Provide a clean way to customize your app with server-specific settings (e.g. file locations, database info, etc.) which doesn’t involve editing the packaged files - otherwise they’ll be overwritten on the next release. This also makes it easy to cleanly install the same package on development, test, and production servers.

For Django apps, I’m using the following conventions:

  • Everything installs in /opt/my_app - this includes a virtualenv which is pre-loaded with our dependencies, avoiding the possibility of conflicts between projects. Want to have a Django 1.1 app installed on the same server as an older Django 1.0 app? This makes that easy.
  • Apache configuration is split into a separate common configuration (e.g. WSGI config, media expiration, etc) designed to be included by a server-specific file which specifies things like hostnames, SSL config, etc.:
    		<VirtualHost example.org:80>
      ServerName example.org
     
      ErrorLog logs/my_app.errors.log
      CustomLog logs/my_app.access.log combined
      LogLevel info
     
      SSLEngine on
      SSLProtocol all -SSLv2
      SSLCipherSuite ALL:!ADH:!EXPORT56:+HIGH:+SSLv3:+TLSv1
      SSLCertificateFile "/etc/httpd/ssl.crt/my_app.crt"
      SSLCertificateKeyFile "/etc/httpd/ssl.key/my_app.key"
     
      Include /opt/my_app/deploy/apache_common.conf 
    </VirtualHost>
      
  • Django customization is managed by a local_settings.py file which is imported by settings.py.
    		try:
      import local_settings
    except ImportError:
      logging.warning("No local settings - running in development mode")
      
    This is where you put things like database username and password, production email contact addresses, etc.

Building an RPM

  1. Setup your RPM build environment
  2. Create your specfile (see below) in SPECS/my_site.spec
  3. Create source archives for everything you need to install: this be as simple as downloading a tarfile from the library provider or creating your own from your version control system:
    • Git archive:
      git archive --format=tar --prefix=my_app-1.0/ my_app-1.0 | gzip -9 > ~/rpmbuild/SOURCES/my_app-1.0.tar.gz
    • Subversion:
      svn export . /tmp/my_package
      tar -C /tmp -cjf ~/rpmbuild/SOURCES/my_app.tar.bz2
    • Now you’re ready to compile the actual RPM:
      rpmbuild -ba --clean SPECS/my_site.spec
    • Install the RPM on your test server

    If you want to see what files your RPM will install, use RPM’s query options: rpm -q --fileprovide -p RPMS/noarch/my_site.rpm

    For future releases the process is simple: update the specfile if you’ve changed your dependencies (add, remove, change versions, etc.) and recompile.

    Here’s an example project containing an RPM specfile and the general recommended site structure. There are a few key things you will want to customize:

    • Set dependencies for any libraries which you need, particularly if there are version requirements - that way RPM won’t allow you to install the site if the installed Postgres client is older than you want.
    • The %post and %install sections can contain arbitrary shell scripts, allowing you to do things like run `manage.py syncdb`, push updated schema to Solr, etc. If this gets too complicated I recommend writing a Python program or Django management command which does the actual work.
    Jul 25

    Site testing using RED Spider

    Mark Nottingham recently released redbot, a modern replacement for the classic cacheability tester. I've been using it at work to audit website performance before releases since proper HTTP caching makes an enormous difference in perceived site performance.

    redbot is a focused tool and provides a great deal of detail about at most one page and, optionally, its resources. I wanted to expand the scope to testing an entire site and performing content validation and with a little work came up with red_spider.py, which produces a consolidated report like this.

    I have a few ideas for the future, which should involve splitting the code into a separate project rather than a fork of redbot as it acquires more validation capabilities such as borrowing from something like collective.validator.css to validate CSS, RSS/Atom, etc., using PIL to verify that images don't have things like wasteful embedded thumbnails, and borrowing from my wk-bench experiment to load pages using WebKit and report JavaScript errors.

    Apr 03

    Even friendlier shell prompts for version control

    Including version control status in your Bash prompt, now for SVN, CVS, Mercurial and Git

    I've extended the earlier VCS-friendly shell prompt to add support for Mercurial and you can now get my current .bash_profile from GitHub:

    __has_parent_dir(){# Utility function so we can test for things like .git/.hg without firing
     # up a separate process
     test -d "$1"&& return 0;
    
     current="."while [ !"$current" -ef "$current/.." ];doif [ -d "$current/$1" ];then
       return 0;fi
      current="$current/..";done
    
     return 1;}__vcs_name(){if [ -d .svn ];then 
     echo " [svn]";elif [ -d RCS ];then 
     echo " [RCS]";elif __has_parent_dir ".git";then
      echo "$(__git_ps1 ' [git %s]')";elif __has_parent_dir ".hg";then
      echo " [hg $(hg branch)]"fi}
    
    PS1='\[\033]0;\u@\h:\w\007\]\u@\h:\w$(__vcs_name) $ '
    Apr 03

    Cleaning up the web with jQuery and a little help from Google

    My guide to creating jQuery-based bookmarklets

    Recently the topic of enhancing web pages came up at work. It's a lot easier than it used to be thanks to two trends: the rise of modern JavaScript libraries and public unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

    Possible causes are:

    If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

    > hosting those libraries. This makes a lot easier to enhance content which you can't easily alter (e.g. the forms used by various big companies with marginal web competency) or in situations where you're worried about compatibility with existing code (some squirrelly vertical apps in our case).

    Updated 2009-04-03: Moved the template and example scripts to Gist for ease of copying/maintenance: bookmarklet-template.js, enable-autocomplete.js and resize-textareas.js

    Updated 2008-10-14: there's a very similar jQuery-lovefest on Sam Ruby's weblog with plenty of useful tips.

    To illustrate just how little code this can require, here's an example which uses jQuery to install a function which sanitizes input (we have a legacy app chokes on smart-quotes and people paste text in from Word), copies the submit buttons from the bottom of the form to the top and adds a graphical datepicker for every date field on the page:

    jQuery(":text,textarea").bind("change", sanitizer);
    jQuery("form").bind("submit",function(){
      jQuery(":text,textarea").each(sanitizer);});var submit_buttons = jQuery('input[type="submit"]');
    submit_buttons.parent().clone(true).prependTo(
     submit_buttons.parents().filter('form'));
    
    jQuery('input[id*="DATE"]').datepicker();

    That's the complete, ready-to-go, “even works with crotchety old Internet Explorer” guts of the code (the take-home lesson is that jQuery is awesome for busy developers). The downside is that this requires a little but of work: you need to have jQuery (and possibly dependencies like the UI plugin I used above) available and you need to jump through some hoops to load jQuery into an existing page efficiently and without conflicts.

    Didn't we used to pay for hosting?

    One drawback to all of this is that you need somewhere to host your external libraries since you can't fit the core jQuery into a URL, much less UI components or the less svelte libraries. This meant setting up a server, getting an SSL certificate if you need to work on HTTPS sites, etc. Not that much work but it's now a lot easier and quite noticeably faster because Google makes it trivial to get the popular AJAX libraries from their CDN.

    Developing with Bookmarklets

    The deployment scenario for the major projects where I've used these techniques is a situation where you have some limited access to the page source: perhaps inserting a single script tag into a template or using something like MonkeyGrease or an Apache proxy with mod_substitute to rewrite the generated HTML as it passes through. This is great for making minimal changes but a bit cumbersome to develop and test with, particularly if you need to work on a production site or your instructions begin something like “Go change your browser's proxy settings…”

    If I was only working in Firefox I could use GreaseMonkey but I need to test in Safari and Internet Explorer, too. The portable solution is a simple unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

    Possible causes are:

    If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

    .> I use a simple template (bookmarklet-template.js) which loads jQuery from the Google CDN and, after everything is ready to go, runs either a simple function or the external script of my choosing. This makes it easy to prepare an injector bookmarklet which can be used to pull my code into the current page, after which I can run and debug it using Firebug.

    Useful Examples

    This is also a useful technique for fixing other people's pages. Here are two bookmarklets and the commented source for tools which I use often:

    1. Enable autocomplete - changes autocomplete="off" to on throughout the page (Source: enable-autocomplete.js)
    2. Resizable Textareas - makes all textareas resizable (alá Safari 3) using jQuery UI Resizable (Source: resize-textareas.js)

    I keep both of these in my Firefox & IE bookmark toolbar since they come in handy throughout the day and I've created more any time I find myself regularly needing to deal with a cranky legacy site. The process is simple: copy bookmarklet-template.js, add the code which does whatever fixups the target page needs, run the entire thing through JSLint and, finally paste it into Ted Mielczarek's very handy Bookmarklet Crunchinator.

    Good Code Injection Practices

    Use Anonymous functions

    What's the difference between this bit of code and the first example above?

    (function(){
     jQuery(":text,textarea").bind("change", sanitizer);
     jQuery("form").bind("submit",function(){
       jQuery(":text,textarea").each(sanitizer);});var submit_buttons = jQuery('input[type="submit"]');
     submit_buttons.parent().clone(true).prependTo(
      submit_buttons.parents().filter('form'));
    
     jQuery('input[id*="DATE"]').datepicker();})();

    It looks almost identical but there's a key difference: this code is inside an anonymous function and that means that all of my variables are local to the function itself, which means that they won't be visible to other JavaScript on the page and I don't have to worry about conflicting variable or function names. Note that this is only true for variables declared using "var" - if you leave that out or do something like window.foo you can still touch the rest of the page if you need to - for example, replacing the broken validation logic on Comcast's forms.

    Reliably detecting when external code has loaded

    When jQuery has loaded, it's easy to say "Load this .js file and run this function when it's ready" - here's how the text-area resizer works:

    jQuery.getScript(document.location.protocol+"//ajax.googleapis.com/ajax/libs/jqueryui/1.5.2/jquery-ui.js",function(){
      jQuery("textarea").resizable();});

    Loading jQuery itself requires you to do this the hard way: generate a script tag on the fly, insert it into the document and listen for the load events to tell when it's safe to run code which depends on the library you're loading. This is easy for Safari, Firefox, etc. which support the standard W3C DOM addEventListener: simply run your code after the script tag fires a "load" event. Unfortunately, it's not that simple for Internet Explorer: in theory attachEvent("onload") would be equivalent but unfortunately load events are quite unreliable for script tags with IE and so we need to use an onReadyStateChange handler as seen below and check for either of two events which may be fired:

    var s =document.createElement('script');
    s.type="text/javascript";
    s.setAttribute('src', document.location.protocol+'//ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js');if(s.addEventListener){
     s.addEventListener("load", loader, false);}elseif("onreadystatechange"in s){s.onreadystatechange=function(){if(this.readyState=='complete'||this.readyState=='loaded'){
       loader();}};}else{// Chances are if your browser is this old jQuery won't even work but just in case:
     window.setTimeout(loader(), 2500);}document.getElementsByTagName('head')[0].appendChild(s);

    It's conceivable that a buggy browser could fire the same event twice in an unusual scenario and if you have any sort of user-driven or timer-based code, you'll want to prevent your payload from being run multiple times using a guard like this which allows the function to check whether it has executed before without using the more common approach of relying on a global variable. Besides cleanliness, this also makes it easy if you might inject multiple things onto a page and don't want to have to rely only on a global variable naming convention to prevent chaos:

    // Avoid executing this function twice:
    if(arguments.callee._executed)return;
    arguments.callee._executed =true;

    Avoid HTTP/HTTPS conflicts

    If you're injecting code into pages which may or may not use SSL, you have a problem: if you hard-code a URL in your code and the protocol doesn't match you'll either incur the extra overhead of starting an SSL session (which isn't a major problem) by using https even when you don't need to or encounter Internet Explorer's popular mixed-mode security warning. This is easy to avoid by using the current page's protocol for your scripts as long as you're using a server which can handle either protocol (Google's CDN does; Yahoo's does not):

    document.location.protocol+'//path.to.example.com/something.js'