Unicode Syntax
Elixir supports Unicode throughout the language.
Quoted identifiers, such as strings ("olá"
) and charlists ('olá'
), support Unicode since Elixir v1.0. Strings are UTF-8 encoded. Charlists are lists of Unicode codepoints. In such cases, the contents are kept as written by developers, without any transformation.
Elixir also supports Unicode in identifiers since Elixir v1.5, as defined in the Unicode Annex #31. The focus of this document is to describe how Elixir implements the requirements outlined in the Unicode Annex. These requirements are refered to as R1, R6 and so on.
To check the Unicode version of your current Elixir installation, run String.Unicode.version()
.
R1. Default Identifiers
Elixir identifiers are specified as:
<Identifier> := <Start> <Continue>* <Ending>?
where <Start>
is:
characters derived from the Unicode General Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus
Other_ID_Start
, minusPattern_Syntax
andPattern_White_Space
code pointsIn set notation:
[[:L:][:Nl:][:Other_ID_Start:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]
and <Continue>
is:
ID_Start characters, plus characters having the Unicode General Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus
Other_ID_Continue
, minusPattern_Syntax
andPattern_White_Space
code points.In set notation:
[[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:][:Other_ID_Continue:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]
<Ending>
is an addition specific to Elixir that includes the codepoints ? (003F) and ! (0021).
Elixir also implements requirements R1a for security and R1b for backwards compatibility.
Atoms
Atoms in Elixir follow the identifier rule above with the following modifications:
-
<Start>
includes the codepoint _ (005F) -
<Continue>
includes the codepoint @ (0040)
Variables
Variables in Elixir follow the identifier rule above with the following modifications:
-
<Start>
includes the codepoint _ (005F) -
<Start>
must not include Lu (letter uppercase) and Lt (letter titlecase) characters
R3. Pattern_White_Space and Pattern_Syntax Characters
Elixir supports only codepoints \t (0009), \n (000A), \r (000D) and \s (0020) as whitespace and therefore does not follow requirement R3. R3 requires a wider variety of whitespace and syntax characters to be supported.
R6. Filtered Normalized Identifiers
Identifiers in Elixir are case sensitive.
Elixir requires all atoms and variables to be in NFC form. Any other form will fail with a relevant error message. Quoted-atoms and variables can, however, be in any form and are not verified by the parser.
In other words, the atom :josé
can only be written with the codepoints 006A 006F 0073 00E9. Using another normalization form will lead to a tokenizer error. On the other hand, :"josé"
may be written as 006A 006F 0073 00E9 or 006A 006F 0073 0065 0301, since it is written in quotes.
Choosing requirement R6 automatically excludes requirements R4, R5 and R7.
© 2012 Plataformatec
Licensed under the Apache License, Version 2.0.
https://hexdocs.pm/elixir/1.6.6/unicode-syntax.html