charClass
Character Classification
Description
An interface to the (C99) wide character classification functions in use.
Usage
charClass(x, class)
Arguments
x | Either a UTF-8-encoded length-1 character vector or an integer vector of Unicode points (or a vector coercible to integer). |
class | A character string, one of those given in the ‘Details’ section. |
Details
The classification into character classes is platform-dependent. The classes are determined by internal tables on Windows and (optionally but by default) on macOS and AIX.
The character classes are interpreted as follows:
"alnum"
-
Alphabetic or numeric.
"alpha"
-
Alphabetic.
"blank"
-
Space or tab.
"cntrl"
-
Control characters.
"digit"
-
Digits
0-9
. "graph"
-
Graphical characters (printable characters except whitespace).
"lower"
-
Lower-case alphabetic.
"print"
-
Printable characters.
"punct"
-
Punctuation characters. Some platforms treat all non-alphanumeric graphical characters as punctuation.
"space"
-
Whitespace, including tabs, form and line feeds and carriage returns. Some OSes include non-breaking spaces, some exclude them.
"upper"
-
Upper-case alphabetic.
"xdigit"
-
Hexadecimal character, one of
0-9A-fa-f
.
Alphabetic characters contain all lower- and upper-case ones and some others (for example, those in ‘title case’).
Whether a character is printable is used to decide whether to escape it when printing – see the help for print.default
.
If x
is a character string it should either be ASCII or declared as UTF-8 – see Encoding
.
charClass
was added in R 4.1.0. A less direct way to examine character classes which also worked in earlier versions is to use something like grepl("[[:print:]]", intToUtf8(x))
– however, the regular-expression code might not use the same classification functions as printing and on macOS used not to.
Value
A logical vector of the length the number of characters or integers in x
.
Note
Non-ASCII digits are excluded by the C99 standard from the class "digit"
: most platforms will have them as alphabetic.
It is an assumption that the system's wide character classification functions are coded in Unicode points, but this is known to be true for all recent platforms.
In principle the classification could depend on the locale even on one platform, but that seems no longer to be seen.
See Also
Character classes are used in regular expressions.
The OS's man
pages for iswctype
and wctype
.
Examples
x <- c(48:70, 32, 0xa0) # Last is non-breaking space cl <- c("alnum", "alpha", "blank", "digit", "graph", "punct", "upper", "xdigit") X <- lapply(cl, function(y) charClass(x,y)); names(X) <- cl X <- as.data.frame(X); row.names(X) <- sQuote(intToUtf8(x, multiple = TRUE)) X charClass("ABC123", "alpha") ## Some accented capital Greek characters (x <- "\u0386\u0388\u0389") charClass(x, "upper") ## How many printable characters are there? (Around 280,000 in Unicode 13.) ## There are 2^21-1 possible Unicode points (most not yet assigned). pr <- charClass(1:0x1fffff, "print") table(pr)
Copyright (©) 1999–2012 R Foundation for Statistical Computing.
Licensed under the GNU General Public License.