[Java] Class CharsetToolkit
- groovy.util.CharsetToolkit
public class CharsetToolkit extends Object
Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file); // guess the encoding Charset guessedCharset = toolkit.getCharset(); // create a reader with the correct charset BufferedReader reader = toolkit.getReader(); // read the file content String line; while ((line = br.readLine())!= null) { System.out.println(line); }
Constructor Summary
Constructor and description |
---|
CharsetToolkit
(File file) Constructor of the CharsetToolkit utility class. |
Methods Summary
Type Params | Return Type | Name and description |
---|---|---|
public static Charset[] |
getAvailableCharsets() Retrieves all the available Charset s on the platform, among which the default charset . | |
public Charset |
getCharset() | |
public Charset |
getDefaultCharset() Retrieves the default Charset | |
public static Charset |
getDefaultSystemCharset() Retrieve the default charset of the system. | |
public boolean |
getEnforce8Bit() Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding. | |
public BufferedReader |
getReader() Gets a BufferedReader (indeed a LineNumberReader ) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered. | |
public boolean |
hasUTF16BEBom() Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). | |
public boolean |
hasUTF16LEBom() Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). | |
public boolean |
hasUTF8Bom() Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors). | |
public void |
setDefaultCharset(Charset defaultCharset) Defines the default Charset used in case the buffer represents an 8-bit Charset . | |
public void |
setEnforce8Bit(boolean enforce) If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. |
Inherited Methods Summary
Methods inherited from class | Name |
---|---|
class Object | wait, wait, wait, equals, toString, hashCode, getClass, notify, notifyAll |
Constructor Detail
public CharsetToolkit(File file)
Constructor of the CharsetToolkit
utility class.
- Parameters:
-
file
- of which we want to know the encoding.
Method Detail
public static Charset[] getAvailableCharsets()
Retrieves all the available Charset
s on the platform, among which the default charset
.
- Returns:
- an array of
Charset
s.
public Charset getCharset()
public Charset getDefaultCharset()
Retrieves the default Charset
public static Charset getDefaultSystemCharset()
Retrieve the default charset of the system.
- Returns:
- the default
Charset
.
public boolean getEnforce8Bit()
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
- Returns:
- a boolean representing the flag of use of US-ASCII.
public BufferedReader getReader()
Gets a BufferedReader
(indeed a LineNumberReader
) from the File
specified in the constructor of CharsetToolkit
using the charset discovered or the default charset if an 8-bit Charset
is encountered.
- throws:
- FileNotFoundException if the file is not found.
- Returns:
- a
BufferedReader
public boolean hasUTF16BEBom()
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
- Returns:
- true if the buffer has a BOM for UTF-16 Big Endian.
public boolean hasUTF16LEBom()
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
- Returns:
- true if the buffer has a BOM for UTF-16 Low Endian.
public boolean hasUTF8Bom()
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
- Returns:
- true if the buffer has a BOM for UTF8.
public void setDefaultCharset(Charset defaultCharset)
Defines the default Charset
used in case the buffer represents an 8-bit Charset
.
- Parameters:
-
defaultCharset
- the defaultCharset
to be returned if an 8-bitCharset
is encountered.
public void setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset
rather than US-ASCII.
- Parameters:
-
enforce
- a boolean specifying the use or not of US-ASCII.
© 2003-2020 The Apache Software Foundation
Licensed under the Apache license.
https://docs.groovy-lang.org/3.0.7/html/gapi/groovy/util/CharsetToolkit.html