Character Properties
A character property is a named attribute of a character that specifies how the character behaves and how it should be handled during text processing and display. Thus, character properties are an important part of specifying the character’s semantics.
On the whole, Emacs follows the Unicode Standard in its implementation of character properties. In particular, Emacs supports the Unicode Character Property Model, and the Emacs character property database is derived from the Unicode Character Database (UCD). See the Character Properties chapter of the Unicode Standard, for a detailed description of Unicode character properties and their meaning. This section assumes you are already familiar with that chapter of the Unicode Standard, and want to apply that knowledge to Emacs Lisp programs.
In Emacs, each property has a name, which is a symbol, and a set of possible values, whose types depend on the property; if a character does not have a certain property, the value is nil
. As a general rule, the names of character properties in Emacs are produced from the corresponding Unicode properties by downcasing them and replacing each ‘_’ character with a dash ‘-’. For example, Canonical_Combining_Class
becomes canonical-combining-class
. However, sometimes we shorten the names to make their use easier.
Some codepoints are left unassigned by the UCD—they don’t correspond to any character. The Unicode Standard defines default values of properties for such codepoints; they are mentioned below for each property.
Here is the full list of value types for all the character properties that Emacs knows about:
name
-
Corresponds to the
Name
Unicode property. The value is a string consisting of upper-case Latin letters A to Z, digits, spaces, and hyphen ‘-’ characters. For unassigned codepoints, the value isnil
. general-category
-
Corresponds to the
General_Category
Unicode property. The value is a symbol whose name is a 2-letter abbreviation of the character’s classification. For unassigned codepoints, the value isCn
. canonical-combining-class
-
Corresponds to the
Canonical_Combining_Class
Unicode property. The value is an integer. For unassigned codepoints, the value is zero. bidi-class
-
Corresponds to the Unicode
Bidi_Class
property. The value is a symbol whose name is the Unicode directional type of the character. Emacs uses this property when it reorders bidirectional text for display (see Bidirectional Display). For unassigned codepoints, the value depends on the code blocks to which the codepoint belongs: most unassigned codepoints get the value ofL
(strong L), but some get values ofAL
(Arabic letter) orR
(strong R). decomposition
-
Corresponds to the Unicode properties
Decomposition_Type
andDecomposition_Value
. The value is a list, whose first element may be a symbol representing a compatibility formatting tag, such assmall
18; the other elements are characters that give the compatibility decomposition sequence of this character. For characters that don’t have decomposition sequences, and for unassigned codepoints, the value is a list with a single member, the character itself. decimal-digit-value
-
Corresponds to the Unicode
Numeric_Value
property for characters whoseNumeric_Type
is ‘Decimal’. The value is an integer, ornil
if the character has no decimal digit value. For unassigned codepoints, the value isnil
, which means NaN, or “not a number”. digit-value
-
Corresponds to the Unicode
Numeric_Value
property for characters whoseNumeric_Type
is ‘Digit’. The value is an integer. Examples of such characters include compatibility subscript and superscript digits, for which the value is the corresponding number. For characters that don’t have any numeric value, and for unassigned codepoints, the value isnil
, which means NaN. numeric-value
-
Corresponds to the Unicode
Numeric_Value
property for characters whoseNumeric_Type
is ‘Numeric’. The value of this property is a number. Examples of characters that have this property include fractions, subscripts, superscripts, Roman numerals, currency numerators, and encircled numbers. For example, the value of this property for the character U+2155 VULGAR FRACTION ONE FIFTH is0.2
. For characters that don’t have any numeric value, and for unassigned codepoints, the value isnil
, which means NaN. mirrored
-
Corresponds to the Unicode
Bidi_Mirrored
property. The value of this property is a symbol, eitherY
orN
. For unassigned codepoints, the value isN
. mirroring
-
Corresponds to the Unicode
Bidi_Mirroring_Glyph
property. The value of this property is a character whose glyph represents the mirror image of the character’s glyph, ornil
if there’s no defined mirroring glyph. All the characters whosemirrored
property isN
havenil
as theirmirroring
property; however, some characters whosemirrored
property isY
also havenil
formirroring
, because no appropriate characters exist with mirrored glyphs. Emacs uses this property to display mirror images of characters when appropriate (see Bidirectional Display). For unassigned codepoints, the value isnil
. paired-bracket
-
Corresponds to the Unicode
Bidi_Paired_Bracket
property. The value of this property is the codepoint of a character’s paired bracket, ornil
if the character is not a bracket character. This establishes a mapping between characters that are treated as bracket pairs by the Unicode Bidirectional Algorithm; Emacs uses this property when it decides how to reorder for display parentheses, braces, and other similar characters (see Bidirectional Display). bracket-type
-
Corresponds to the Unicode
Bidi_Paired_Bracket_Type
property. For characters whosepaired-bracket
property is non-nil
, the value of this property is a symbol, eithero
(for opening bracket characters) orc
(for closing bracket characters). For characters whosepaired-bracket
property isnil
, the value is the symboln
(None). Likepaired-bracket
, this property is used for bidirectional display. old-name
-
Corresponds to the Unicode
Unicode_1_Name
property. The value is a string. For unassigned codepoints, and characters that have no value for this property, the value isnil
. iso-10646-comment
-
Corresponds to the Unicode
ISO_Comment
property. The value is either a string ornil
. For unassigned codepoints, the value isnil
. uppercase
-
Corresponds to the Unicode
Simple_Uppercase_Mapping
property. The value of this property is a single character. For unassigned codepoints, the value isnil
, which means the character itself. lowercase
-
Corresponds to the Unicode
Simple_Lowercase_Mapping
property. The value of this property is a single character. For unassigned codepoints, the value isnil
, which means the character itself. titlecase
-
Corresponds to the Unicode
Simple_Titlecase_Mapping
property. Title case is a special form of a character used when the first character of a word needs to be capitalized. The value of this property is a single character. For unassigned codepoints, the value isnil
, which means the character itself. special-uppercase
-
Corresponds to Unicode language- and context-independent special upper-casing rules. The value of this property is a string (which may be empty). For example mapping for U+00DF LATIN SMALL LETTER SHARP S is
"SS"
. For characters with no special mapping, the value isnil
which meansuppercase
property needs to be consulted instead. special-lowercase
-
Corresponds to Unicode language- and context-independent special lower-casing rules. The value of this property is a string (which may be empty). For example mapping for U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE the value is
"i\u0307"
(i.e. 2-character string consisting of LATIN SMALL LETTER I followed by U+0307 COMBINING DOT ABOVE). For characters with no special mapping, the value isnil
which meanslowercase
property needs to be consulted instead. special-titlecase
Corresponds to Unicode unconditional special title-casing rules. The value of this property is a string (which may be empty). For example mapping for U+FB01 LATIN SMALL LIGATURE FI the value is
"Fi"
. For characters with no special mapping, the value isnil
which meanstitlecase
property needs to be consulted instead.
- Function: get-char-code-property char propname
-
This function returns the value of char’s propname property.
(get-char-code-property ?\s 'general-category) ⇒ Zs
(get-char-code-property ?1 'general-category) ⇒ Nd
;; U+2084 (get-char-code-property ?\N{SUBSCRIPT FOUR} 'digit-value) ⇒ 4
;; U+2155 (get-char-code-property ?\N{VULGAR FRACTION ONE FIFTH} 'numeric-value) ⇒ 0.2
;; U+2163 (get-char-code-property ?\N{ROMAN NUMERAL FOUR} 'numeric-value) ⇒ 4
(get-char-code-property ?\( 'paired-bracket) ⇒ 41 ;; closing parenthesis
(get-char-code-property ?\) 'bracket-type) ⇒ c
- Function: char-code-property-description prop value
-
This function returns the description string of property prop’s value, or
nil
if value has no description.(char-code-property-description 'general-category 'Zs) ⇒ "Separator, Space"
(char-code-property-description 'general-category 'Nd) ⇒ "Number, Decimal Digit"
(char-code-property-description 'numeric-value '1/5) ⇒ nil
- Function: put-char-code-property char propname value
This function stores value as the value of the property propname for the character char.
- Variable: unicode-category-table
The value of this variable is a char-table (see Char-Tables) that specifies, for each character, its Unicode
General_Category
property as a symbol.
- Variable: char-script-table
-
The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols.
- Variable: char-width-table
The value of this variable is a char-table that specifies the width of each character in columns that it will occupy on the screen.
- Variable: printable-chars
The value of this variable is a char-table that specifies, for each character, whether it is printable or not. That is, if evaluating
(aref printable-chars char)
results int
, the character is printable, and if it results innil
, it is not.
Copyright © 1990-1996, 1998-2021 Free Software Foundation, Inc.
Licensed under the GNU GPL license.
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html