Fahri NH

Plan the Future, Do the Day, Learn the Past

Nov 8, 2017 - 3 minute read - Comments - Java

Java: How To Autodetect The Charset Encoding of A Text File and Remove Byte Order Mark (BOM)

TLDR;

Required dependencies (pom.xml) :

 1 2 3 4 5 6 7 8 910
<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>60.1</version>
</dependency>
<dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.6</version>
</dependency>

Autodetect the charset encoding of a text file or input stream then ‘remove’ (skip) Byte Order Mark (BOM) while reading based on detected charset :

 1 2 3 4 5 6 7 8 9101112131415161718
File inputFile = new File("/Users/fahri/Downloads/UNKNOWN_TEXT.txt");

BOMInputStream bomInputStream = new BOMInputStream(new BufferedInputStream(new FileInputStream(inputFile)),
        ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

System.out.println("HAS BOM : " + bomInputStream.hasBOM());

CharsetDetector detector = new CharsetDetector();
detector.setText(bomInputStream);

CharsetMatch charsetMatch = detector.detect();
System.out.println("CHARSET MATCH : " + charsetMatch.getName());

BufferedReader br = new BufferedReader(new InputStreamReader(bomInputStream, charsetMatch.getName()));
for (String line = br.readLine(); line != null; line = br.readLine()) {
    System.out.println(line);
}
br.close();

Problem

You got a text file and you have no idea why your application could not process (parse) that file.

This is a sample of that file: UNKNOWN_TEXT.txt

Analyze The Content

cat

If we inspect the content using cat :

ss-cat-unknown

The content looks normal. But, wait.
What is that weird characters at the beginning and end ? (�� and %)

more

Display the content using another tool, more : ss-more-unknown

It looks weirder. <FF><FE> at the beginning and ^@ between the characters.

file

What file command says ?

12
$ file UNKNOWN_TEXT.txt
UNKNOWN_TEXT.txt: Little-endian UTF-16 Unicode text  

The text file is a Little-Endian UTF-16 text and our viewer (iTerm2 on OSX, in this case) only supports UTF-8. That is why it looks not normal and your application can not process (parse) further.

hexdump

If we investigate in more detail using hexdump :

123456789
$ hexdump UNKNOWN_TEXT.txt
0000000 ff fe 31 00 7c 00 43 00 65 00 6c 00 6c 00 32 00
0000010 31 00 7c 00 31 00 32 00 33 00 34 00 7c 00 32 00
0000020 30 00 31 00 37 00 2f 00 31 00 30 00 2f 00 32 00
0000030 30 00 0a 00 32 00 7c 00 43 00 65 00 6c 00 6c 00
0000040 32 00 32 00 7c 00 35 00 36 00 37 00 38 00 7c 00
0000050 32 00 30 00 31 00 37 00 2f 00 31 00 30 00 2f 00
0000060 32 00 30 00
0000064

We can conclude that <FF><FE> and ^@ represents value ff fe and 00 in hex, respectively.

Endianness

UTF-16 represents a character using two bytes (2x8 byte=16, make sense, huh?). There are two ways to determine the order of the bytes 1 :

  • Big Endian : you store the most significant byte in the smallest address
  • Little Endian : you store the least significant byte in the smallest address

For the concrete example, these are characters Z in various encoding :

Character                    : Z
ASCII (in hex)               : 5A
UTF16 Little Endian (in hex) : 5A 00
UTF16 Big Endian (in hex)    : 00 5A

Byte Order Mark (BOM)

You already know the meaning of 00. Then, what exactly is ff fe in the first bytes ?