Web Toolbox
red_spider
A spider based on Mark Nottingham's redbot: it will produce a nice HTML report of page cacheability and, optionally, HTML validation; since it uses the same nbhttp library it's pretty fast, too. There are a number of options for filtering and it allows you to save lists of page and media URLs for use with tools like wk-bench or tornado-bench.
If you need to replace webserver log files at something approximating realtime, log_replay is your friend. It uses Tornado's non-blocking HTTP client (based on pycurl - at some point it would be good to refactor down to just that) to fetch all of the URLs but will sleep any time it's too far ahead of the simulated virtual time.
tornado-bench
Also uses Tornado's non-blocking HTTP client, this program simply takes a big list of URLs and simply retrieves them as quickly as possible.
wk-bench
Mac OS X-specific tool which measures user-perceived page-load performance. It uses PyObjC to load a full WebKit browser, processes a list of URLs and reports the time taken from beginning the HTTP request until the browser fires the didFinishLoadForFrame event, which includes things like image loading, Flash, JavaScript, etc. This is also useful for reporting JavaScript errors as they are logged to the console and can very easily be extracted for verifying that you don't have on-load errors site-wide.
The view from our window
Deploying Django Sites using RPM
Why?
Django is a great framework for developing websites but as with most projects there isn’t a particular focus on the system administration side of running a real site. There are great instructions describing the source-level changes you’ll want to make and what you’ll need to configure your webserver to do but … what about afterwards?
The process of deploying any site has a few basic steps: update the code, apply any database changes and reload the running site. How people choose to do this varies wildly but in the Python world it tends to involve a lot of manual work setting up the Python environment and running commands by hand or using a tool such as Fabric or Buildout to run those commands for you.
This approach works but it has a few drawbacks:
- If you’re installing anything which uses a native library (images, databases, clients for things like memcache, etc.) you’ll need install a ton of extra dependencies on your production servers: gcc, development headers, etc.
- It’s extremely slow compared to normal Linux software installation
- You’re exposed to failures in outside services such as PyPI, Sourceforge, Github, Bitbucket, etc. This can be a problem if you can’t download the package and disastrous if the upstream source has updated to a newer version which you haven’t tested
- Adding a second server requires you to duplicate everything, which is time consuming, and you then have to apply your changes in lockstep across every server to avoid requests being processed depending on which server handles the request
- It’s hard to tell in advance what you’ll need to install an application unless you tediously compare the installed dependencies on a known-working server
- You can have silent failures which only show up in production - the classic example being the Python Imaging Library, which can install “successfully” without optional components like JPEG support if it fails to detect them, which will only show up the first time someone attempts to use a JPEG file with your site.
None of these are new problems - in fact, the BSD and Linux communities have been working on unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.
Possible causes are:
If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.
managers for the last decade or two, which is why you can install a brand new Linux system with hundreds of applications in less time than it takes to bring a large Python website up on a new server.How
UPDATE: This has turned into a Github project and the latest version of these instructions are on Github pages.
Structuring the application
A well-behaved application is going to do a few things:
- Install in a well-defined location not used by other applications - i.e. /opt/my_app rather than /var/www/html.
- Provide a clean way to customize your app with server-specific settings (e.g. file locations, database info, etc.) which doesn’t involve editing the packaged files - otherwise they’ll be overwritten on the next release. This also makes it easy to cleanly install the same package on development, test, and production servers.
For Django apps, I’m using the following conventions:
- Everything installs in /opt/my_app - this includes a virtualenv which is pre-loaded with our dependencies, avoiding the possibility of conflicts between projects. Want to have a Django 1.1 app installed on the same server as an older Django 1.0 app? This makes that easy.
-
Apache configuration is split into a separate common configuration (e.g. WSGI config, media expiration, etc) designed to be included by a server-specific file which specifies things like hostnames, SSL config, etc.:
<VirtualHost example.org:80> ServerName example.org ErrorLog logs/my_app.errors.log CustomLog logs/my_app.access.log combined LogLevel info SSLEngine on SSLProtocol all -SSLv2 SSLCipherSuite ALL:!ADH:!EXPORT56:+HIGH:+SSLv3:+TLSv1 SSLCertificateFile "/etc/httpd/ssl.crt/my_app.crt" SSLCertificateKeyFile "/etc/httpd/ssl.key/my_app.key" Include /opt/my_app/deploy/apache_common.conf </VirtualHost>
-
Django customization is managed by a local_settings.py file which is imported by settings.py.
try: import local_settings except ImportError: logging.warning("No local settings - running in development mode")This is where you put things like database username and password, production email contact addresses, etc.
Building an RPM
- Setup your RPM build environment
- Create your specfile (see below) in SPECS/my_site.spec
-
Create source archives for everything you need to install: this be as simple as downloading a tarfile from the library provider or creating your own from your version control system:
-
Git archive:
git archive --format=tar --prefix=my_app-1.0/ my_app-1.0 | gzip -9 > ~/rpmbuild/SOURCES/my_app-1.0.tar.gz -
Subversion:
svn export . /tmp/my_package
tar -C /tmp -cjf ~/rpmbuild/SOURCES/my_app.tar.bz2 -
Now you’re ready to compile the actual RPM:
rpmbuild -ba --clean SPECS/my_site.spec - Install the RPM on your test server
If you want to see what files your RPM will install, use RPM’s query options: rpm -q --fileprovide -p RPMS/noarch/my_site.rpm
For future releases the process is simple: update the specfile if you’ve changed your dependencies (add, remove, change versions, etc.) and recompile.
Here’s an example project containing an RPM specfile and the general recommended site structure. There are a few key things you will want to customize:
- Set dependencies for any libraries which you need, particularly if there are version requirements - that way RPM won’t allow you to install the site if the installed Postgres client is older than you want.
-
The
%postand%installsections can contain arbitrary shell scripts, allowing you to do things like run `manage.py syncdb`, push updated schema to Solr, etc. If this gets too complicated I recommend writing a Python program or Django management command which does the actual work.
Site testing using RED Spider
Mark Nottingham recently released redbot, a modern replacement for the classic cacheability tester. I've been using it at work to audit website performance before releases since proper HTTP caching makes an enormous difference in perceived site performance.
redbot is a focused tool and provides a great deal of detail about at most one page and, optionally, its resources. I wanted to expand the scope to testing an entire site and performing content validation and with a little work came up with red_spider.py, which produces a consolidated report like this.
I have a few ideas for the future, which should involve splitting the code into a separate project rather than a fork of redbot as it acquires more validation capabilities such as borrowing from something like collective.validator.css to validate CSS, RSS/Atom, etc., using PIL to verify that images don't have things like wasteful embedded thumbnails, and borrowing from my wk-bench experiment to load pages using WebKit and report JavaScript errors.
Even friendlier shell prompts for version control
I've extended the earlier VCS-friendly shell prompt to add support for Mercurial and you can now get my current .bash_profile from GitHub:
__has_parent_dir(){# Utility function so we can test for things like .git/.hg without firing # up a separate process test -d "$1"&& return 0; current="."while [ !"$current" -ef "$current/.." ];doif [ -d "$current/$1" ];then return 0;fi current="$current/..";done return 1;}__vcs_name(){if [ -d .svn ];then echo " [svn]";elif [ -d RCS ];then echo " [RCS]";elif __has_parent_dir ".git";then echo "$(__git_ps1 ' [git %s]')";elif __has_parent_dir ".hg";then echo " [hg $(hg branch)]"fi} PS1='\[\033]0;\u@\h:\w\007\]\u@\h:\w$(__vcs_name) $ '
Cleaning up the web with jQuery and a little help from Google
Recently the topic of enhancing web pages came up at work. It's a lot easier than it used to be thanks to two trends: the rise of modern JavaScript libraries and public unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.
Possible causes are:
If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.
> hosting those libraries. This makes a lot easier to enhance content which you can't easily alter (e.g. the forms used by various big companies with marginal web competency) or in situations where you're worried about compatibility with existing code (some squirrelly vertical apps in our case).Updated 2009-04-03: Moved the template and example scripts to Gist for ease of copying/maintenance: bookmarklet-template.js, enable-autocomplete.js and resize-textareas.js
Updated 2008-10-14: there's a very similar jQuery-lovefest on Sam Ruby's weblog with plenty of useful tips.
To illustrate just how little code this can require, here's an example which uses jQuery to install a function which sanitizes input (we have a legacy app chokes on smart-quotes and people paste text in from Word), copies the submit buttons from the bottom of the form to the top and adds a graphical datepicker for every date field on the page:
jQuery":text,textarea"bind"change"sanitizer; jQuery"form"bind"submit"function jQuery":text,textarea"eachsanitizer;;var submit_buttons = jQuery'input[type="submit"]'; submit_buttonsparentclonetrueprependTo submit_buttonsparentsfilter'form'; jQuery'input[id*="DATE"]'datepicker;
That's the complete, ready-to-go, “even works with crotchety old Internet Explorer” guts of the code (the take-home lesson is that jQuery is awesome for busy developers). The downside is that this requires a little but of work: you need to have jQuery (and possibly dependencies like the UI plugin I used above) available and you need to jump through some hoops to load jQuery into an existing page efficiently and without conflicts.
Didn't we used to pay for hosting?
One drawback to all of this is that you need somewhere to host your external libraries since you can't fit the core jQuery into a URL, much less UI components or the less svelte libraries. This meant setting up a server, getting an SSL certificate if you need to work on HTTPS sites, etc. Not that much work but it's now a lot easier and quite noticeably faster because Google makes it trivial to get the popular AJAX libraries from their CDN.
Developing with Bookmarklets
The deployment scenario for the major projects where I've used these techniques is a situation where you have some limited access to the page source: perhaps inserting a single script tag into a template or using something like MonkeyGrease or an Apache proxy with mod_substitute to rewrite the generated HTML as it passes through. This is great for making minimal changes but a bit cumbersome to develop and test with, particularly if you need to work on a production site or your instructions begin something like “Go change your browser's proxy settings…”
If I was only working in Firefox I could use GreaseMonkey but I need to test in Safari and Internet Explorer, too. The portable solution is a simple unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.
Possible causes are:
If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.
.> I use a simple template (bookmarklet-template.js) which loads jQuery from the Google CDN and, after everything is ready to go, runs either a simple function or the external script of my choosing. This makes it easy to prepare an injector bookmarklet which can be used to pull my code into the current page, after which I can run and debug it using Firebug.Useful Examples
This is also a useful technique for fixing other people's pages. Here are two bookmarklets and the commented source for tools which I use often:
- Enable autocomplete - changes autocomplete="off" to on throughout the page (Source: enable-autocomplete.js)
- Resizable Textareas - makes all textareas resizable (alá Safari 3) using jQuery UI Resizable (Source: resize-textareas.js)
I keep both of these in my Firefox & IE bookmark toolbar since they come in handy throughout the day and I've created more any time I find myself regularly needing to deal with a cranky legacy site. The process is simple: copy bookmarklet-template.js, add the code which does whatever fixups the target page needs, run the entire thing through JSLint and, finally paste it into Ted Mielczarek's very handy Bookmarklet Crunchinator.
Good Code Injection Practices
Use Anonymous functions
What's the difference between this bit of code and the first example above?
function jQuery":text,textarea"bind"change"sanitizer; jQuery"form"bind"submit"function jQuery":text,textarea"eachsanitizer;;var submit_buttons = jQuery'input[type="submit"]'; submit_buttonsparentclonetrueprependTo submit_buttonsparentsfilter'form'; jQuery'input[id*="DATE"]'datepicker;;
It looks almost identical but there's a key difference: this code is inside an anonymous function and that means that all of my variables are local to the function itself, which means that they won't be visible to other JavaScript on the page and I don't have to worry about conflicting variable or function names. Note that this is only true for variables declared using "var" - if you leave that out or do something like window.foo you can still touch the rest of the page if you need to - for example, replacing the broken validation logic on Comcast's forms.
Reliably detecting when external code has loaded
When jQuery has loaded, it's easy to say "Load this .js file and run this function when it's ready" - here's how the text-area resizer works:
jQuerygetScriptdocumentlocationprotocol+"//ajax.googleapis.com/ajax/libs/jqueryui/1.5.2/jquery-ui.js"function jQuery"textarea"resizable;;
Loading jQuery itself requires you to do this the hard way: generate a script tag on the fly, insert it into the document and listen for the load events to tell when it's safe to run code which depends on the library you're loading. This is easy for Safari, Firefox, etc. which support the standard W3C DOM addEventListener: simply run your code after the script tag fires a "load" event. Unfortunately, it's not that simple for Internet Explorer: in theory attachEvent("onload") would be equivalent but unfortunately load events are quite unreliable for script tags with IE and so we need to use an onReadyStateChange handler as seen below and check for either of two events which may be fired:
var s =documentcreateElement'script'; stype="text/javascript"; ssetAttribute'src'documentlocationprotocol+'//ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js';ifsaddEventListener saddEventListener"load"loaderfalse;elseif"onreadystatechange"in ss.onreadystatechange=function()ifthisreadyState=='complete'||thisreadyState=='loaded' loader;;else// Chances are if your browser is this old jQuery won't even work but just in case: windowsetTimeoutloader2500;documentgetElementsByTagName'head'0appendChilds;
It's conceivable that a buggy browser could fire the same event twice in an unusual scenario and if you have any sort of user-driven or timer-based code, you'll want to prevent your payload from being run multiple times using a guard like this which allows the function to check whether it has executed before without using the more common approach of relying on a global variable. Besides cleanliness, this also makes it easy if you might inject multiple things onto a page and don't want to have to rely only on a global variable naming convention to prevent chaos:
// Avoid executing this function twice: ifargumentscallee_executedreturn; argumentscallee_executed =true;
Avoid HTTP/HTTPS conflicts
If you're injecting code into pages which may or may not use SSL, you have a problem: if you hard-code a URL in your code and the protocol doesn't match you'll either incur the extra overhead of starting an SSL session (which isn't a major problem) by using https even when you don't need to or encounter Internet Explorer's popular mixed-mode security warning. This is easy to avoid by using the current page's protocol for your scripts as long as you're using a server which can handle either protocol (Google's CDN does; Yahoo's does not):
documentlocationprotocol+'//path.to.example.com/something.js'



