iconv
Convert Character Vector between Encodings
Description
This uses system facilities to convert a character vector between encodings: the ‘i’ stands for ‘internationalization’.
Usage
iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE) iconvlist()
Arguments
x | A character vector, or an object to be converted to a character vector by |
from | A character string describing the current encoding. |
to | A character string describing the target encoding. |
sub | character string. If not |
mark | logical, for expert use. Should encodings be marked? |
toRaw | logical. Should a list of raw vectors be returned rather than a character vector? |
Details
The names of encodings and which ones are available are platform-dependent. All R platforms support ""
(for the encoding of the current locale), "latin1"
and "UTF-8"
. Generally case is ignored when specifying an encoding.
On most platforms iconvlist
provides an alphabetical list of the supported encodings. On others, the information is on the man page for iconv(5)
or elsewhere in the man pages (but beware that the system command iconv
may not support the same set of encodings as the C functions R calls). Unfortunately, the names are rarely supported across all platforms.
Elements of x
which cannot be converted (perhaps because they are invalid or because they cannot be represented in the target encoding) will be returned as NA
unless sub
is specified.
Most versions of iconv
will allow transliteration by appending //TRANSLIT to the to
encoding: see the examples.
Encoding "ASCII"
is accepted, and on most systems "C"
and "POSIX"
are synonyms for ASCII.
Any encoding bits (see Encoding
) on elements of x
are ignored: they will always be translated as if from encoding from
even if declared otherwise. enc2native
and enc2utf8
provide alternatives which do take declared encodings into account.
Note that implementations of iconv
typically do not do much validity checking and will often mis-convert inputs which are invalid in encoding from
.
If sub = "Unicode"
is used for a non-UTF-8 input it is the same as sub = "byte"
.
Value
If toRaw = FALSE
(the default), the value is a character vector of the same length and the same attributes as x
(after conversion to a character vector).
If mark = TRUE
(the default) the elements of the result have a declared encoding if to
is "latin1"
or "UTF-8"
, or if to = ""
and the current locale's encoding is detected as Latin-1 (or its superset CP1252 on Windows) or UTF-8.
If toRaw = TRUE
, the value is a list of the same length and the same attributes as x
whose elements are either NULL
(if conversion fails) or a raw vector.
For iconvlist()
, a character vector (typically of a few hundred elements) of known encoding names.
Implementation Details
There are three main implementations of iconv
in use. Linux's C runtime glibc contains one. Several platforms supply GNU libiconv, including macOS, FreeBSD and Cygwin, in some cases with additional encodings. On Windows we use a version of Yukihiro Nakadaira's win_iconv, which is based on Windows' codepages. (We have added many encoding names for compatibility with other systems.) All three have iconvlist
, ignore case in encoding names and support //TRANSLIT (but with different results, and for win_iconv currently a ‘best fit’ strategy is used except for to = "ASCII"
).
Most commercial Unixes contain an implementation of iconv
but none we have encountered have supported the encoding names we need: the ‘R Installation and Administration’ manual recommends installing GNU libiconv on Solaris and AIX, for example.
There are other implementations, e.g. NetBSD has used one from the Citrus project (which does not support //TRANSLIT) and there is an older FreeBSD port (libiconv is usually used there): it has not been reported whether or not these work with R.
Note that you cannot rely on invalid inputs being detected, especially for to = "ASCII"
where some implementations allow 8-bit characters and pass them through unchanged or with transliteration.
Some of the implementations have interesting extra encodings: for example GNU libiconv allows to = "C99"
to use \uxxxx escapes for non-ASCII characters.
Byte Order Marks
most commonly known as ‘BOMs’.
Encodings using character units which are more than one byte in size can be written on a file in either big-endian or little-endian order: this applies most commonly to UCS-2, UTF-16 and UTF-32/UCS-4 encodings. Some systems will write the Unicode character U+FEFF
at the beginning of a file in these encodings and perhaps also in UTF-8. In that usage the character is known as a BOM, and should be handled during input (see the ‘Encodings’ section under connection
: re-encoded connections have some special handling of BOMs). The rest of this section applies when this has not been done so x
starts with a BOM.
Implementations will generally interpret a BOM for from
given as one of "UCS-2"
, "UTF-16"
and "UTF-32"
. Implementations differ in how they treat BOMs in x
in other from
encodings: they may be discarded, returned as character U+FEFF
or regarded as invalid.
Note
The only reasonably portable name for the ISO 8859-15 encoding, commonly known as ‘Latin 9’, is "latin-9"
: some platforms support "latin9"
but GNU libiconv does not.
Encoding names "utf8"
, "mac"
and "macroman"
are not portable. "utf8"
is converted to "UTF-8"
for from
and to
by iconv
, but not for e.g. fileEncoding
arguments. "macintosh"
is the official (and most widely supported) name for ‘Mac Roman’ (https://en.wikipedia.org/wiki/Mac_OS_Roman).
See Also
Examples
## In principle, as not all systems have iconvlist try(utils::head(iconvlist(), n = 50)) ## Not run: ## convert from Latin-2 to UTF-8: two of the glibc iconv variants. iconv(x, "ISO_8859-2", "UTF-8") iconv(x, "LATIN2", "UTF-8") ## End(Not run) ## Both x below are in latin1 and will only display correctly in a ## locale that can represent and display latin1. x <- "fa\xE7ile" Encoding(x) <- "latin1" x charToRaw(xx <- iconv(x, "latin1", "UTF-8")) xx iconv(x, "latin1", "ASCII") # NA iconv(x, "latin1", "ASCII", "?") # "fa?ile" iconv(x, "latin1", "ASCII", "") # "faile" iconv(x, "latin1", "ASCII", "byte") # "fa<e7>ile" iconv(xx, "UTF-8", "ASCII", "Unicode") # "fa<U+00E7>ile" ## Extracts from old R help files (they are nowadays in UTF-8) x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent iconv(x, "latin1", "ASCII", sub = "byte") ## and for Windows' 'Unicode' str(xx <- iconv(x, "latin1", "UTF-16LE", toRaw = TRUE)) iconv(xx, "UTF-16LE", "UTF-8")
Copyright (©) 1999–2012 R Foundation for Statistical Computing.
Licensed under the GNU General Public License.