December 1, 2020
Internationalised Domain Names
I have recently launched a new website (wajad.art) with an internationalised domain name (IDN). The domain name and all the page paths are in Arabic, which makes things more fun given Arabic is written right to left (RTL).
Domain name is وجد.موقع
, and موقع
is the top-level domain (TLD), which is the equivalent to the site
TLD.
A full URL example of page:
1وجد.موقع/م/أساطير-خليجية
How do IDNs work?
Given the Domain Name System (DNS) has to use ASCII characters, they store IDNs as ASCII strings using Punycode, which is:
a representation of Unicode with the limited ASCII character subset used for Internet hostnames
So while my newly launched website's domain name is وجد.موقع
, in the DNS it is stored in the Punycode equivalent:
1xn--rgbg7e.xn--4gbrim
Fortunately, there are Punycode converters such as: - punycoder.com - punycode.io
Domain Name vs Page Path
The IDNs use Punycode to work around the DNS limitation of only supporting ASCII characters. This does not apply to the rest of the URL. You can use Unicode characters in the page path.
MDN's What is a URL? is a good resource to learn the different parts of a URL.
Browsers
Unicode vs Punycode (ASCII)
When navigating to a website with an IDN via the browser address bar, both the Unicode (e.g. وجد.موقع
) and the Punycode (e.g. xn--rgbg7e.xn--4gbrim
) work.
Even if you typed in the Punycode in the address bar, web browsers may automatically convert the URL to the (human-friendly) Unicode equivalent if the URL meets the browser's IDN policy: - Google Chrome's IDN policy - Firefox's IDN Display Algorithm
The goal of these policies is to protect users from IDN homograph attack. There are also browser extensions that alerts users if they are on a site that uses Punycode in its domain name.
Early this year, Google Chrome Developers YouTube channel's show HTTP 203 released an episode titled Humans can't read URLs. How can we fix it?. Jake and Surma briefly discuss how Chrome analyses the URL and when it may choose to display the Punycode over the Unicode.
RTL vs LTR
If you ever mixed RTL and LTR languages when typing something on a digital device, you'd certainly have experienced frustrating times attempting to get words to flow correctly. The browser address bar doesn't handle this too well either.
In the case of وجد.موقع
, it is read RTL. However, adding the http
protocol at the start means the address starts LTR. So you end up with:
1https://وجد.موقع
Even if you set the language in the browser to Arabic, which converts the UI to RTL:
This is not a huge pain point, but it does look odd. As a developer I know the start is https://
. However, to an Arabic speaker who is not familiar with the protocol and its uses, they may interpret this as the URL ends in https://
.
This may be slightly off-topic, but it is worth noting that things become even more confusing if you use an ASCII domain name (LTR) with an RTL page path and vice versa:
1وجد.موقع/path/to/page2 3xn--rgbg7e.xn--4gbrim/الصفحة-1/الصفحة-2
Copying the URL
When you are on a website with an IDN and copy the URL directly from the address bar, what gets copied into your clipboard varies across browsers.
Firefox (83.0) copies the Unicode:
1https://وجد.موقع
Chrome's (87.0.4280.66) behaviour is more sophisticated. If you include the https
protocol when you copy the URL from the address bar, it copies the Punycode into your clipboard:
1https://xn--rgbg7e.xn--4gbrim
If you exclude the https
protocol, it copies the Unicode:
1وجد.موقع
The above behaviour only applies to the domain name. When it comes to the page path, the behaviour is also inconsistent across browsers.
Firefox (83.0) encodes the page path to its UTF-8 representation when the URL is copied (think JavaScript's encodeURI()
, or PHP's urlencode()
), which is a huge UX pain for me in general and not only with IDNs. Receiving a URL in a chat app that fills up half my phone's screen with %
s and a mix of meaningless digits and English characters is pointless to me as a user.
1https://وجد.موقع/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
On Chrome (87.0.4280.66), if the https
protocol is included, it copies the Punycode domain name and the encoded page path:
1https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
If the https
protocol is excluded, it copies the whole URL in Unicode:
1وجد.موقع/م/أساطير-خليجية
Sharing the URL
Web browsers on smartphones and tablets offer a built-in sharing option, which gives you the choice to copy the URL or share directly to native apps. The behaviour across browsers here is also inconsistent.
The same browser may not behave consistently when copying the URL from the address vs when copying/sharing the URL using the built-in share option. Samsung Internet (13.0.1.64), for instance, copies the Unicode (domain and page path) if you copy the URL directly from the address bar:
1https://وجد.موقع/م/أساطير-خليجية
However, it copies the Punycode and the encoded page path when using the built-in share option:
1https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
JavaScript
The Location API returns the domain name in Punycode and encodes page paths:
1{ 2 "ancestorOrigins": {}, 3 "href": "https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9", 4 "origin": "https://xn--rgbg7e.xn--4gbrim", 5 "protocol": "https:", 6 "host": "xn--rgbg7e.xn--4gbrim", 7 "hostname": "xn--rgbg7e.xn--4gbrim", 8 "port": "", 9 "pathname": "/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9",10 "search": "",11 "hash": ""12}
The wilderness
I have used a number of services in which I had to enter the IDN for Wajad or on which the domain name is displayed.
Domain registration
Registering a domain with Punycode with a common TLD like .com
is not an obstacle. Some domain registrars allow you to use Unicode when searching domains e.g. وجد.com
.
But I was looking for the internationalised TLD موقع
. It was not easy finding a domain registrar that sold موقع
domains. I ended up on multiple scammy-looking sites during my search. Eventually I bought the domain via maracaria.com.
Cloudflare
I had no issues adding IDNs with Unicode when adding the site to Cloudflare. They are also displayed in Unicode in the dashboard:
However, Cloudflare used the Punycode in the email notifications they sent to me so far:
Netlify
Before launching the site, I set up a "coming soon" landing page on Netlify. Unlike Cloudflare, Netlify did not allow me to add the domain name with Unicode, and I had to enter the Punycode equivalent. Netlify's dashboard displays the domain in Punycode:
Their email notifications also display the domain in Punycode:
Cloudways
Wajad's current PHP-based site is hosted on DigitalOcean via Cloudways. The experience on Cloudways is similar to Netlify and I had to enter the Punycode:
Google Search Console
I was able to add the site to Google Search Console with the Unicode version of the domain. Oddly, some subsequent forms did not accept Unicode:
So I had to enter the Punycode equivalent, but Google Search Console displayed the URL in Unicode after submitting the form:
Fortunately, email notifications use Unicode:
Google Search results
Google Search results display the domain name in Unicode. I already knew it displayed Arabic correctly for breadcrumbs, but it is really nice to see the domain name displayed in a human-friendly manner:
Both Unicode and Punycode are supported when using search operators like site:
:
1site:وجد.موقع2 3site:xn--rgbg7e.xn--4gbrim
Bing Webmaster Tools
Bing Webmaster Tools allow you to import verified sites from Google Search Console. Upon an import attempt it displayed an error message saying the site addition was unsuccessful:
I attempted to enter the URL manually as suggested, but the Unicode was not accepted:
Then when I went to check the list of sites under my account, Wajad was actually listed! I'm not entirely sure which of the above attempts was the successful one.
Bing Webmaster Tools lists the domain in Unicode, but when you open the dashboard for the site it lists the Punycode:
I had the opposite experience to Google Search Console when submitting the sitemap. The form accepted the Unicode, but the sitemap list displays the Punycode:
Bing search results
I have only recently submitted the sitemap via Bing Webmaster Tools, so I still do not know the full picture. From what I can tell so far Bing search results also display the domain in Unicode.
However, it seems only Punycode is supported when using search operators like site:
:
1site:xn--rgbg7e.xn--4gbrim
Fathom Analytics
I had no issue using the Unicode version of the domain when adding the site to Fathom Analytics. The domain is always displayed in Unicode (dashboard and email notifications).
Their recently-launched tool Phantom Analyzer also allowed me to enter the URL in Unicode, but the results page displayed the domain in Punycode.
Zoho Mail
Neither Unicode nor Punycode is supported when signing up to Zoho Mail.
Emails
G Suite (now Google Workspace) allowed me to sign up with my IDN. I sent test emails to Gmail, Yahoo and Outlook. Gmail was the only one to display the domain name in Unicode.
I have also sent HTML email tests with images. Yahoo Mail and Windows Mail did not load images whose src
had the domain in Unicode, but Gmail did:
1<img src="https://وجد.موقع/path/to/image.jpg" alt="">
Auto-linking
When sending messages via chat apps, adding the https
protocol to the URL (with domain name in Unicode) seems to be enough for most apps.
Although email clients are known for linking text in HTML emails when you don't want them to, I found Gmail and Windows Mail don't auto-link:
1https://وجد.موقع
In-app browsers
The behaviour of in-app browsers is consistent. On iOS, Instagram's in-app browser displays the domain in Punycode, while Twitter's in-app browser displays the domain in Unicode.
I understand, but..
I understand why I'm seeing very different behaviour across browsers and apps, but as a developer and a user I just would love to see a better user experience overall.
Wajad is still a young side project, but it is clear to me that I'll run into more interesting IDN-related scenarios as it grows and I'll try my best to document them.