So you think you know what a number is…

I was recently surprised by a 404 which I noticed in Google's Webmaster Tools, which pointed to a truncated URL (http://www.wdl.org/en/item/) which is never actually linked to. This happens frequently but in this case the failure was interesting because the source link was an unlinked URL on a Persian blog and the source link actually worked: http://www.wdl.org/en/item/۲۶۷۹/.
The canonical URL for that page is http://www.wdl.org/en/item/2679/ and the author had presumably pasted the URL into a page where their software had helpfully converted the latin digits (e.g. 0123456789) into what are known as Eastern Arabic, Hindi or Indic-Arabic numerals: ٠١٢٣٤٥٦٧٨٩ (a closer look shows that this is the Persian variant as the 6 is actually ۶ instead of ٦).

Presumably Googlebot scans HTML for text which look like URLs but uses a limited parser which breaks as soon as it finds a character which isn't in a limited set of characters, presumably only ISO-8859-1, causing it to break the URL while other services (e.g. Facebook, Google+, Github, etc.) extract the full URL.

So … mystery solved, we're done, right?

Wait, why does that URL work in the first place? We never added that as a supported feature and, being security-aware, all of our URLs are carefully validated to ensure that the IDs are valid numeric values – both as part of Django's URL dispatching and when the item is retrieved from the database. Besides, since the item IDs aren't assigned using Eastern-Arabic digits how does it actually match the record?

Python's documentation for int() doesn't mention anything about this but isdecimal() gives us a clue:

Decimal characters include digit characters, and all characters that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.

What's happening here is that Python uses the Unicode definition of “digit”, which is actually a fairly long list. The specification uses the “Nd = Number, decimal digit” classification and a quick look at the Unidata script extensions list shows quite a few different characters flagged with Nd. Python should treat all of those as valid digits for the purposes of calling int(), isdecimal(), and even the regular expressions used to validate our URLs. ۲۶۷۹ will match \d+ and will have been converted to the number 2679 by the time the database sees it.

If you want to see this in action, I created a quick test script which allows you to test numeric characters from the full list. Try it interactively in your browser: http://repl.it/X3G.

Left as an exercise for the reader is building a tool which will translate URLs to a specific script, such as this Burmese variant for the Ramayana: http://www.wdl.org/en/item/၁၄၂၈၅/

comments powered by Disqus