Java中的字符编码(Unicode、UTF-8、UTF-16)的那些事儿 (二)

2014-11-24 09:04:27 · 作者: · 浏览: 4
le", "UnicodeBigUnmarked",
"UnicodeLittleUnmarked", "UTF-16", "UTF-16BE", "UTF-16LE" };

for (int i = 0; i < encoding.length; i++) {
System.out
.printf("%-22s %s%n", encoding[i], bytes2HexString(str.getBytes(encoding[i])));
}
}

public static String bytes2HexString(byte[] bys) {
char[] chs = new char[bys.length * 2 + bys.length - 1];
for (int i = 0, offset = 0; i < bys.length; i++) {
if (i > 0) {
chs[offset++] = ' ';
}
chs[offset++] = HEX[bys[i] >> 4 & 0xf];
chs[offset++] = HEX[bys[i] & 0xf];
}
return new String(chs);
}
}

public class Test {
private final static char[] HEX = "0123456789abcdef".toCharArray();

public static void main(String[] args) throws UnsupportedEncodingException {
String str = "中国";
String[] encoding = { "Unicode", "UnicodeBig", "UnicodeLittle", "UnicodeBigUnmarked",
"UnicodeLittleUnmarked", "UTF-16", "UTF-16BE", "UTF-16LE" };

for (int i = 0; i < encoding.length; i++) {
System.out
.printf("%-22s %s%n", encoding[i], bytes2HexString(str.getBytes(encoding[i])));
}
}

public static String bytes2HexString(byte[] bys) {
char[] chs = new char[bys.length * 2 + bys.length - 1];
for (int i = 0, offset = 0; i < bys.length; i++) {
if (i > 0) {
chs[offset++] = ' ';
}
chs[offset++] = HEX[bys[i] >> 4 & 0xf];
chs[offset++] = HEX[bys[i] & 0xf];
}
return new String(chs);
}
}
运行结果如下:

Unicode fe ff 4e 2d 56 fd
UnicodeBig fe ff 4e 2d 56 fd
UnicodeLittle ff fe 2d 4e fd 56
UnicodeBigUnmarked 4e 2d 56 fd
UnicodeLittleUnmarked 2d 4e fd 56
UTF-16 fe ff 4e 2d 56 fd
UTF-16BE 4e 2d 56 fd
UTF-16LE 2d 4e fd 56


可以看到几个不同的Unicode和UTF-16编码的字节顺序是不同的,有的是fe ff,有的是ff fe,有的没有。


总上所述:

Unicode和UTF-16:1个字符占2个字节(不管是哪国语言)

UTF-8:1个英文字符占1个字节,一个汉字(包括日文和韩文等)占3个字节

Java中的char默认采用Unicode编码,所以Java中char占2个字节


另外,顺便提一个知识点:1个字节(byte)占8位(bit)


作者:tianjf0514