August 21, 2021

Arabic script on the Web

A few months ago I wondered whether it was possible to use CSS to control the positioning of diacritics within Arabic script. This has led me to a very interesting read: Text Layout Requirements for the Arabic Script (W3C working draft).

Even as a native Arabic speaker, there are many points in that document (or other documents it links to) I had not really thought about. The Arabic and Persian Gap Analysis also highlights gaps in the support for Arabic script on the web.

It is worth noting this affects more than just "websites". The document specifically says "on the Web and in eBooks". This also means in-browser and browser-based apps like Electron apps. The current gaps limit in-browser design apps in particular on how much they can support Arabic script (this doesn't excuse Figma for not supporting RTL/Arabic at all!).

Unicode BIDI Algorithm

An important piece in RTL text support in browsers is the Unicode Bidirectional algorithm:

It is important to understand from the outset that, in all major web browsers, the order of characters in memory (logical) is not the same as the order in which they are displayed (visual).

The set of rules applied by the browser to produce the correct order at the time of display are described by the Unicode Bidirectional Algorithm, or 'bidi algorithm' for short.

If you ever worked on sentences containing a mix of both RTL-flowing and LTR-flowing characters, you know it is painful to author the content in many apps and it requires a bit of extra effort to get it to render ok in browsers. When you have control over the HTML, you can get the text flowing correctly. Having an understanding of how the BIDI algorithm works may help you to handle such sentences in HTML even if you don't read the languages. Understanding the algorithm can also be helpful in crafting sentences that render fine when you have no control over the HTML (e.g. subject line in email clients).

I'm not going to attempt to explain the algorithm here, but I'd like to highlight some interesting aspects.

Unicode characters can have different types: strong directional characters (e.g. alphabet characters), weak directional characters (e.g. common number separators) or neutral characters (e.g. whitespace). You can find a table containing examples of each type in section 3.2 of this document.

What's even more interesting is that some characters are logically interpreted in the Unicode Standard. The U+0028 character (HTML entity: (, AKA: () is not interpreted as "left parenthesis", but rather as "opening parenthesis". If you're familiar with CSS logical properties, this concept of logical directions is not alien to you.

See the Pen Logical Unicode characters by Hussein Al Hammad (@hus_hmd) on CodePen.