How to Read UTF-8 Encoded Files

Background

This article demonstrates how to read files written in various logographic characters like Kanji (from Japan), Hanja (from Korea), and Hanzi (from China) using UTF-8 encoding.

Hardware Environment

n/a

Software Environment

Windows 7 Professional SP1
Eclipse – Kepler Release
Java 1.7 (1.7.0_67 – Windows x86)

First things, first – for Eclipse

By default, Eclipse does not use UTF-8 encoding to display logographic characters. With that setting, “????” characters are displayed instead. To avoid them, change Eclipse file encoding to UTF-8 as shown on the image below.

The Codes to Read UTF-8 Encoded Files

/*
 * Copyright (C) 2014 www.turreta.com
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.turreta.io.file;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

public class UTF8FileReader {
	public void read(File file) throws UnsupportedEncodingException, IOException {
		BufferedReader in = null;
		try {
			in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
			String str;
			while ((str = in.readLine()) != null) {
				System.out.println("Processing line: " + str);
			}
		} finally {
			in.close();
		}
	}
}