I was recently surprised by a 404 which I noticed in Google's Webmaster Tools,
which pointed to a truncated URL (
http://www.wdl.org/en/item/) which is
never actually linked to. This happens frequently but in this case the failure was
interesting because the source link was an unlinked URL on a Persian blog and the
source link actually worked:
The canonical URL for that page is http://www.wdl.org/en/item/2679/ and the author had presumably pasted the URL into a page where their software had helpfully converted the latin digits (e.g.
0123456789) into what are
Eastern Arabic, Hindi or Indic-Arabic numerals:
٠١٢٣٤٥٦٧٨٩ (a closer look shows that this is the Persian variant as the
6 is actually ۶ instead of
Presumably Googlebot scans HTML for text which look like URLs but uses a limited parser which breaks as soon as it finds a character which isn't in a limited set of characters, presumably only ISO-8859-1, causing it to break the URL while other services (e.g. Facebook, Google+, Github, etc.) extract the full URL.
So … mystery solved, we're done, right?
Wait, why does that URL work in the first place? We never added that as a supported feature and, being security-aware, all of our URLs are carefully validated to ensure that the IDs are valid numeric values – both as part of Django's URL dispatching and when the item is retrieved from the database. Besides, since the item IDs aren't assigned using Eastern-Arabic digits how does it actually match the record?
Python's documentation for
int() doesn't mention anything about this but
gives us a clue:
Decimal characters include digit characters, and all characters that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.
What's happening here is that Python uses the Unicode definition of “digit”, which is actually a fairly long list.
The specification uses the
“Nd = Number, decimal digit” classification
and a quick look at the
Unidata script extensions list
shows quite a few different characters flagged with Nd. Python should treat all of those
as valid digits for the purposes of calling
and even the regular expressions used to validate our URLs. ۲۶۷۹ will match
and will have been converted to the number
2679 by the time the database
If you want to see this in action, I created a quick test script which allows you to test numeric characters from the full list. Try it interactively in your browser: http://repl.it/X3G.
Left as an exercise for the reader is building a tool which will translate URLs to a specific script, such as this Burmese variant for the Ramayana: http://www.wdl.org/en/item/၁၄၂၈၅/