Character encodings

APR 2012

This morning a 24 hr MTB race in Portugal was listed on RiderHQ - a cause for celebration - but notice the garbled address under location:

The location should read: Estádio

Tracking down (and fixing) the cause of this took a bit of investigation, so if you're interested in how websites handle 'non-standard' characters (or why £-symbols sometimes come out wrong when you send an email), here's what happened:

As you may know, computers represent everything with numbers. In order to communicate text, they need to assign each character we might want to draw (e.g. the letter 'a' or the punctuation mark 'comma') to a number. Mapping characters to numbers is called 'encoding', and there are a number of encodings around. Most of them agree on which numbers represent the letters on a US keyboard because the internet's early infrastructure was built in English. However, if you need to represent characters outside these, using the wrong encoding will lead to garbled text.

In our scenario there are several computers involved: event details begin life in your browser, running on your computer, and are sent to a RiderHQ webserver, which passes them to a RiderHQ database server. Later they are read back from database to webserver and finally arrive in someone else's browser. At each of these steps we need to know how the characters are encoded, as well as the characters themselves and to pass that on correctly for the text to be readable end-to-end. 

It turns out that in this case the issue was between the web browser and the web server - although the browser submitted the text in the correct encoding, it wasn't informing the web server which encoding had been used, and the webserver was making an incorrect assumption. This is now fixed, so if you want to list a Spanish, French or Japanese event on RiderHQ - language issues shouldn't stop you!