icuSetCollate
Setup Collation by ICU
Description
Controls the way collation is done by ICU (an optional part of the R build).
Usage
icuSetCollate(...) icuGetCollate(type = c("actual", "valid"))
Arguments
... | Named arguments, see ‘Details’. |
type | character string: can be abbreviated. Either the actual locale in use for collation or the most specific locale which would be valid. |
Details
Optionally, R can be built to collate character strings by ICU (http://site.icu-project.org). For such systems, icuSetCollate
can be used to tune the way collation is done. On other builds calling this function does nothing, with a warning.
Possible arguments are
-
locale
: -
A character string such as
"da_DK"
giving the language and country whose collation rules are to be used. If present, this should be the first argument. -
case_first
: -
"upper"
,"lower"
or"default"
, asking for upper- or lower-case characters to be sorted first. The default is usually lower-case first, but not in all languages (not under the default settings for Danish, for example). -
alternate_handling
: -
Controls the handling of ‘variable’ characters (mainly punctuation and symbols). Possible values are
"non_ignorable"
(primary strength) and"shifted"
(quaternary strength). -
strength
: -
Which components should be used? Possible values
"primary"
,"secondary"
,"tertiary"
(default),"quaternary"
and"identical"
. -
french_collation
: -
In a French locale the way accents affect collation is from right to left, whereas in most other locales it is from left to right. Possible values
"on"
,"off"
and"default"
. -
normalization
: -
Should strings be normalized? Possible values are
"on"
and"off"
(default). This affects the collation of composite characters. -
case_level
: -
An additional level between secondary and tertiary, used to distinguish large and small Japanese Kana characters. Possible values
"on"
and"off"
(default). -
hiragana_quaternary
: -
Possible values
"on"
(sort Hiragana first at quaternary level) and"off"
.
Only the first three are likely to be of interest except to those with a detailed understanding of collation and specialized requirements.
Some special values are accepted for locale
:
-
"none"
: -
ICU is not used for collation: the OS's collation services are used instead.
-
"ASCII"
: -
ICU is not used for collation: the C function
strcmp
is used instead, which should sort byte-by-byte in (unsigned) numerical order. -
"default"
: -
obtains the locale from the OS as is done at the start of the session. If environment variable R_ICU_LOCALE is set to a non-empty value, its value is used rather than consulting the OS, unless environment variable LC_ALL is set to 'C' (or unset but LC_COLLATE is set to 'C').
-
""
,"root"
: -
the ‘root’ collation: see https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation.
For the specifications of ‘real’ ICU locales, see http://userguide.icu-project.org/locale. Note that ICU does not report that a locale is not supported, but falls back to its idea of ‘best fit’ (which could be rather different and is reported by icuGetCollate("actual")
, often "root"
). Most English locales fall back to "root"
as although e.g. "en_GB"
is a valid locale (at least on some platforms), it contains no special rules for collation. Note that "C"
is not a supported ICU locale and hence R_ICU_LOCALE should never be set to "C"
.
Some examples are case_level = "on", strength = "primary"
to ignore accent differences and alternate_handling = "shifted"
to ignore space and punctuation characters.
Initially ICU will not be used for collation if the OS is set to use the C
locale for collation and R_ICU_LOCALE is not set. Once this function is called with a value for locale
, ICU will be used until it is called again with locale = "none"
. ICU will not be used once Sys.setlocale
is called with a "C"
value for LC_ALL
or LC_COLLATE
, even if R_ICU_LOCALE is set. ICU will be used again honoring R_ICU_LOCALE once Sys.setlocale
is called to set a different collation order. Environment variables LC_ALL (or LC_COLLATE) take precedence over R_ICU_LOCALE if and only if they are set to 'C'. Due to the interaction with other ways of setting the collation order, R_ICU_LOCALE should be used with care and only when needed.
All customizations are reset to the default for the locale if locale
is specified: the collation engine is reset if the OS collation locate category is changed by Sys.setlocale
.
Value
For icuGetCollate
, a character string describing the ICU locale in use (which may be reported as "ICU not in use"
). The ‘actual’ locale may be simpler than the requested locale: for example "da"
rather than "da_DK"
: English locales are likely to report "root"
.
Note
ICU is used by default wherever it is available: this include macOS, Solaris and many Linux installations. As it works internally in UTF-8, it will be most efficient in UTF-8 locales.
It is optional on Windows: if R has been built against ICU, it will only be used if environment variable R_ICU_LOCALE is set or once icuSetCollate
is called to select the locale (as ICU and Windows differ in their idea of locale names). Note that icuSetCollate(locale = "default")
should work reasonably well for R >= 3.2.0 and Windows Vista/Server 2008 and later (but finds the system default ignoring environment variables such as LC_COLLATE).
See Also
capabilities
for whether ICU is available; extSoftVersion
for its version.
The ICU user guide chapter on collation (http://userguide.icu-project.org/collation).
Examples
## These examples depend on having ICU available, and on the locale. ## As we don't know the current settings, we can only reset to the default. if(capabilities("ICU")) withAutoprint({ icuGetCollate() icuGetCollate("valid") x <- c("Aarhus", "aarhus", "safe", "test", "Zoo") sort(x) icuSetCollate(case_first = "upper"); sort(x) icuSetCollate(case_first = "lower"); sort(x) ## Danish collates upper-case-first and with 'aa' as a single letter icuSetCollate(locale = "da_DK", case_first = "default"); sort(x) ## Estonian collates Z between S and T icuSetCollate(locale = "et_EE"); sort(x) icuSetCollate(locale = "default"); icuGetCollate("valid") })
Copyright (©) 1999–2012 R Foundation for Statistical Computing.
Licensed under the GNU General Public License.