- Total page hits/unique visits snippet
3 years 46 weeks ago - Busy IRL, but happier than ever
3 years 47 weeks ago - Drupal page titles like breadcrumbs
4 years 4 weeks ago - Theming the Akismet spam counter
4 years 5 weeks ago - Akismet module v1.1.2 for Drupal 4.7
4 years 5 weeks ago
Web Design and Development
JavaScript Minification Part II
SVG with a little help from Raphaël
Prefix or Posthack
Supersize that Background, Please!
Taking Advantage of HTML5 and CSS3 with Modernizr
Stop Forking with CSS3
Getting married and going travelling
It's been a busy month. On Saturday the 5th of June I married the wonderful Natalie Downe in a beautiful ceremony at Roedean School in Brighton. The reception had owls, cheese, a ferret, a golden eagle, amazing Turkish food, Jewish chair dancing and lovely guests. It was the happiest day of my life.
The official wedding photos were taken by Drew McLellan, and there's a Flickr group pool as well. The day after the wedding Natalie's sister Louise took some fun photos of us running around Brighton in our wedding clothes.
Thanks to everyone who helped out with the preparations, and also to everyone who came along to share the special day with us. And a big thanks to Tom Coates, my best man.
Yesterday afternoon, we set out on our honeymoon. I'm writing this from the beach in Nice, on the south coast of France. Tomorrow we take the ferry to Corsica for a week in relative luxury. After that, we're backpacking around Europe, then Africa, then the rest of the world. We've given up our flat and put our stuff in to storage, and the plan is to keep on travelling until we get fed up or run out of money. We expect to be gone for at least 18 months.
Since we're both web developers, we're lucky to be able to take some of our work with us. I'll still be doing some work for the Guardian and Natalie is available for freelance work. If you have something you think we can help you with, drop us a line.
Naturally we'll be blogging, tweeting and Flickring our adventures. You can follow our updates at http://sparkabout.net/.
-->Web Fonts at the Crossing
Quick and Dirty Remote User Testing
Responsive Web Design
Crowdsourced document analysis and MP expenses
As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here's what we built:
http://mps-expenses2.guardian.co.uk/
It's a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.
This is the second time we've tried this - the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week's attempt was an opportunity to apply the lessons we learnt the first time round.
Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice - I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don't manage to get useful data out of it quickly the story will move on before we can impact it.
ScaleCamp on the Friday meant that development work didn't properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.
The system was written using Django, MySQL (InnoDB), Redis and memcached.
Asking the right questionThe biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either "claims" or "receipts" and to rank them as "not interesting", "a bit interesting", "interesting but already known" and "someone should investigate this". We also asked users to optionally enter any numbers they saw on the page as categorised "line items", with the intention of adding these up later.
The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.
The categorisations worked reasonably well but weren't particularly interesting - knowing if a document is a claim or receipt is useful only if you're going to collect line items. The "investigate this" button worked very well though.
We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility - we changed the tags shortly before launch based on the characteristics of the documents - and had the potential to be a lot more fun as well. I'm particularly fond of the "hand-written" tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.
Sticking to an editorially assigned set of tags provided a powerful tool for directing people's investigations, and also ensured our users didn't start creating potentially libellous tags of their own.
Breaking it up in to assignmentsFor the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.
This time round, we added a new concept of "assignments". Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one "main" assignment and several alternative assignments right on the homepage.
Clicking "start reviewing" on an assignment sets a cookie for that assignment, and adds the assignment's progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.
The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.
Get the button right!Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big "start reviewing" buttons. Both were broken in different ways.
The first time round, the mistakes were around scalability. I used a SQL "ORDER BY RAND()" statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn't matter since the button would only be clicked occasionally.
Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.
This solved the performance problem, but meant that our user activity wasn't nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page - and a random distribution is almost certainly the easiest way to achieve that.
The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.
This is where the bug crept in. Redis was just being used as an optimisation - the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some - the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.
This meant the "next page" button would occasionally turn up a page that didn't exist. I had some very poorly considered fallback logic for that - if the random page didn't exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page - which subsequently attracted well over a thousand individual reviews.
Next time, I'm going to try and make the "next" button completely bullet proof! I'm also going to maintain a "denormalisation dictionary" documenting every denormalisation in the system in detail - such a thing would have saved me several hours of confused debugging.
Exposing the resultsThe biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added - mainly because we spent much of the intervening time figuring out the scaling issues.
This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.
This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you're trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit "refresh" and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.
We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I'll pay a lot more attention to getting the pagination options right from the start.
Rewarding our contributorsThe reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn't want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.
For the new version, we tried to provide a much better feeling of activity around the site. We added "top reviewer" tables to every assignment, MP and political party as well as a "most active reviewers in the past 48 hours" table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.
Most importantly, we added a concept of discoveries - editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.
Light-weight registrationFor both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.
It's difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it's hard to tell how true that is. The UI for this process in the first project was quite confusing - we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.
Overall lessonsNews-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the "next thing to review" behaviour rock solid. I'm looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs' expenses.
-->Node.js is genuinely exciting
I gave a talk on Friday at Full Frontal, a new one day JavaScript conference in my home town of Brighton. I ended up throwing away my intended topic (JSONP, APIs and cross-domain security) three days before the event in favour of a technology which first crossed my radar less than two weeks ago.
That technology is Ryan Dahl's Node. It's the most exciting new project I've come across in quite a while.
At first glance, Node looks like yet another take on the idea of server-side JavaScript, but it's a lot more interesting than that. It builds on JavaScript's excellent support for event-based programming and uses it to create something that truly plays to the strengths of the language.
Node describes itself as "evented I/O for V8 javascript". It's a toolkit for writing extremely high performance non-blocking event driven network servers in JavaScript. Think similar to Twisted or EventMachine but for JavaScript instead of Python or Ruby.
Evented I/O?As I discussed in my talk, event driven servers are a powerful alternative to the threading / blocking mechanism used by most popular server-side programming frameworks. Typical frameworks can only handle a small number of requests simultaneously, dictated by the number of server threads or processes available. Long-running operations can tie up one of those threads - enough long running operations at once and the server runs out of available threads and becomes unresponsive. For large amounts of traffic, each request must be handled as quickly as possible to free the thread up to deal with the next in line.
This makes certain functionality extremely difficult to support. Examples include handling large file uploads, combining resources from multiple backend web APIs (which themselves can take an unpredictable amount of time to respond) or providing comet functionality by holding open the connection until a new event becomes available.
Event driven programming takes advantage of the fact that network servers spend most of their time waiting for I/O operations to complete. Operations against in-memory data are incredibly fast, but anything that involves talking to the filesystem or over a network inevitably involves waiting around for a response.
With Twisted, EventMachine and Node, the solution lies in specifying I/O operations in conjunction with callbacks. A single event loop rapidly switches between a list of tasks, firing off I/O operations and then moving on to service the next request. When the I/O returns, execution of that particular request is picked up again.
(In the talk, I attempted to illustrate this with a questionable metaphor involving hamsters, bunnies and a hyperactive squid).
What makes Node exciting?If systems like this already exist, what's so exciting about Node? Quite a few things:
- JavaScript is extremely well suited to programming with callbacks. Its anonymous function syntax and closure support is perfect for defining inline callbacks, and client-side development in general uses event-based programming as a matter of course: run this function when the user clicks here / when the Ajax response returns / when the page loads. JavaScript programmers already understand how to build software in this way.
- Node represents a clean slate. Twisted and EventMachine are hampered by the existence of a large number of blocking libraries for their respective languages. Part of the difficulty in learning those technologies is understanding which Python or Ruby libraries you can use and which ones you have to avoid. Node creator Ryan Dahl has a stated aim for Node to never provide a blocking API - even filesystem access and DNS lookups are catered for with non-blocking callback based APIs. This makes it much, much harder to screw things up.
- Node is small. I read through the API documentation in around half an hour and felt like I had a pretty comprehensive idea of what Node does and how I would achieve things with it.
- Node is fast. V8 is the fast and keeps getting faster. Node's event loop uses Marc Lehmann's highly regarded libev and libeio libraries. Ryan Dahl is himself something of a speed demon - he just replaced Node's HTTP parser implementation (already pretty speedy due to it's Ragel / Mongrel heritage) with a hand-tuned C implementation with some impressive characteristics.
- Easy to get started. Node ships with all of its dependencies, and compiles cleanly on Snow Leopard out of the box.
With both my JavaScript and server-side hats on, Node just feels right. The APIs make sense, it fits a clear niche and despite its youth (the project started in February) everything feels solid and well constructed. The rapidly growing community is further indication that Ryan is on to something great here.
What does Node look like?Here's how to get Hello World running in Node in 7 easy steps:
- git clone git://github.com/ry/node.git (or download and extract a tarball)
- ./configure
- make (takes a while, it needs to compile V8 as well)
- sudo make install
- Save the below code as helloworld.js
- node helloworld.js
- Visit http://localhost:8080/ in your browser
Here's helloworld.js:
var sys = require('sys'), http = require('http'); http.createServer(function(req, res) { res.sendHeader(200, {'Content-Type': 'text/html'}); res.sendBody('<h1>Hello World</h1>'); res.finish(); }).listen(8080); sys.puts('Server running at http://127.0.0.1:8080/');If you have Apache Bench installed, try running ab -n 1000 -c 100 'http://127.0.0.1:8080/' to test it with 1000 requests using 100 concurrent connections. On my MacBook Pro I get 3374 requests a second.
So Node is fast - but where it really shines is concurrency with long running requests. Alter the helloworld.js server definition to look like this:
http.createServer(function(req, res) { setTimeout(function() { res.sendHeader(200, {'Content-Type': 'text/html'}); res.sendBody('<h1>Hello World</h1>'); res.finish(); }, 2000); }).listen(8080);We're using setTimeout to introduce an artificial two second delay to each request. Run the benchmark again - I get 49.68 requests a second, with every single request taking between 2012 and 2022 ms. With a two second delay, the best possible performance for 1000 requests 100 at a time is 1000 requests / (1000 / 100) * 2 seconds = 50 requests a second. Node hits it pretty much bang on the nose.
The most important line in the above examples is res.finish(). This is the mechanism Node provides for explicitly signalling that a request has been fully processed and should be returned to the browser. By making it explicit, Node makes it easy to implement comet patterns like long polling and streaming responses - stuff that is decidedly non trivial in most server-side frameworks.
djangodeNode's core APIs are pretty low level - it has HTTP client and server libraries, DNS handling, asynchronous file I/O etc, but it doesn't give you much in the way of high level web framework APIs. Unsurprisingly, this has lead to a cambrian explosion of lightweight web frameworks based on top of Node - the projects using node page lists a bunch of them. Rolling a framework is a great way of learning a low-level API, so I've thrown together my own - djangode - which brings Django's regex-based URL handling to Node along with a few handy utility functions. Here's a simple djangode application:
var dj = require('./djangode'); var app = dj.makeApp([ ['^/$', function(req, res) { dj.respond(res, 'Homepage'); }], ['^/other$', function(req, res) { dj.respond(res, 'Other page'); }], ['^/page/(\\d+)$', function(req, res, page) { dj.respond(res, 'Page ' + page); }] ]); dj.serve(app, 8008);djangode is currently a throwaway prototype, but I'll probably be extending it with extra functionality as I explore more Node related ideas.
nodecastMy main demo in the Full Frontal talk was nodecast, an extremely simple broadcast-oriented comet application. Broadcast is my favourite "hello world" example for comet because it's both simpler than chat and more realistic - I've been involved in plenty of projects that could benefit from being able to broadcast events to their audience, but few that needed an interactive chat room.
The source code for the version I demoed can be found on GitHub in the no-redis branch. It's a very simple application - the client-side JavaScript simply uses jQuery's getJSON method to perform long-polling against a simple URL endpoint:
function fetchLatest() { $.getJSON('/wait?id=' + last_seen, function(d) { $.each(d, function() { last_seen = parseInt(this.id, 10) + 1; ul.prepend($('<li></li>').text(this.text)); }); fetchLatest(); }); }Doing this recursively is probably a bad idea since it will eventually blow the browser's JavaScript stack, but it works OK for the demo.
The more interesting part is the server-side /wait URL which is being polled. Here's the relevant Node/djangode code:
var message_queue = new process.EventEmitter(); var app = dj.makeApp([ // ... ['^/wait$', function(req, res) { var id = req.uri.params.id || 0; var messages = getMessagesSince(id); if (messages.length) { dj.respond(res, JSON.stringify(messages), 'text/plain'); } else { // Wait for the next message var listener = message_queue.addListener('message', function() { dj.respond(res, JSON.stringify(getMessagesSince(id)), 'text/plain' ); message_queue.removeListener('message', listener); clearTimeout(timeout); }); var timeout = setTimeout(function() { message_queue.removeListener('message', listener); dj.respond(res, JSON.stringify([]), 'text/plain'); }, 10000); } }] // ... ]);The wait endpoint checks for new messages and, if any exist, returns immediately. If there are no new messages it does two things: it hooks up a listener on the message_queue EventEmitter (Node's equivalent of jQuery/YUI/Prototype's custom events) which will respond and end the request when a new message becomes available, and also sets a timeout that will cancel the listener and end the request after 10 seconds. This ensures that long polls don't go on too long and potentially cause problems - as far as the browser is concerned it's just talking to a JSON resource which takes up to ten seconds to load.
When a message does become available, calling message_queue.emit('message') will cause all waiting requests to respond with the latest set of messages.
Talking to databasesnodecast keeps track of messages using an in-memory JavaScript array, which works fine until you restart the server and lose everything. How do you implement persistent storage?
For the moment, the easiest answer lies with the NoSQL ecosystem. Node's focus on non-blocking I/O makes it hard (but not impossible) to hook it up to regular database client libraries. Instead, it strongly favours databases that speak simple protocols over a TCP/IP socket - or even better, databases that communicate over HTTP. So far I've tried using CouchDB (with node-couch) and redis (with redis-node-client), and both worked extremely well. nodecast trunk now uses redis to store the message queue, and provides a nice example of working with a callback-based non-blocking database interface:
var db = redis.create_client(); var REDIS_KEY = 'nodecast-queue'; function addMessage(msg, callback) { db.llen(REDIS_KEY, function(i) { msg.id = i; // ID is set to the queue length db.rpush(REDIS_KEY, JSON.stringify(msg), function() { message_queue.emit('message', msg); callback(msg); }); }); }Relational databases are coming to Node. Ryan has a PostgreSQL adapter in the works, thanks to that database already featuring a mature non-blocking client library. MySQL will be a bit tougher - Node will need to grow a separate thread pool to integrate with the official client libs - but you can talk to MySQL right now by dropping in DBSlayer from the NY Times which provides an HTTP interface to a pool of MySQL servers.
Mixed environmentsI don't see myself switching all of my server-side development over to JavaScript, but Node has definitely earned a place in my toolbox. It shouldn't be at all hard to mix Node in to an existing server-side environment - either by running both behind a single HTTP proxy (being event-based itself, nginx would be an obvious fit) or by putting Node applications on a separate subdomain. Node is a tempting option for anything involving comet, file uploads or even just mashing together potentially slow loading web APIs. Expect to hear a lot more about it in the future.
Further reading- Ryan's JSConf.eu presentation is the best discussion I've seen anywhere of the design philosophy behind Node.
- Node's API documentation is essential reading.
- Streaming file uploads with node.js illustrates how well suited Node is to accepting large file uploads.
- The nodejs Google Group is the hub of the Node community.
Habit Fields
A Brief History of Markup
Comprehensive notes from my three hour Redis tutorial
Last week I presented two talks at the inaugural NoSQL Europe conference in London. The first was presented with Matthew Wall and covered the ways in which we have been exploring NoSQL at the Guardian. The second was a three hour workshop on Redis, my favourite piece of software to have the NoSQL label applied to it.
I've written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven't tried it out yet, you're sorely missing out.
For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here's the end result:
Redis tutorial slides and notes
In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children's charity. Thanks in particular to Clearleft who kindly offered to match every donation.
-->Better JavaScript Minification
Design Patterns: Faceted Navigation
WildlifeNearYou talk at £5 app, and being Wired (not Tired)
Two quick updates about WildlifeNearYou. First up, I gave a talk about the site at £5 app, my favourite Brighton evening event which celebrates side projects and the joy of Making Stuff. I talked about the site's genesis on a fort, crowdsourcing photo ratings, how we use Freebase and DBpedia and how integrating with Flickr's machine tags gave us a powerful location API for free. Here's the video of the talk, courtesy of Ian Oszvald:
£5 App #22 WildLifeNearYou by Simon Willison and Natalie Downe from IanProCastsCoUk on Vimeo.
Secondly, I'm excited to note that WildlifeNearYou spin-off OwlsNearYou.com is featured in UK Wired magazine's Wired / Tired / Expired column... and we're Wired!
-->

Recent comments
3 years 27 weeks ago
3 years 30 weeks ago
3 years 30 weeks ago
3 years 31 weeks ago
3 years 32 weeks ago
3 years 32 weeks ago
3 years 35 weeks ago
3 years 35 weeks ago
3 years 38 weeks ago
3 years 38 weeks ago