FWIDTH_AND_FULLWIDTH_FORMS 判断中文的,号
private static final boolean isChinese(char c) {
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
|| ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|| ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
|| ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
|| ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
|| ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
return true;
}
return false;
}
总结一下,一般来说用第一种方法就足够了,扩展字符集用的比较少,另外从HanLP的github星数来说,这种方案的通用度还是可信的。
好了,先到这里。这里也留一个坑,mysql数据库里边的VARCHAR类型和Java的char类型是一种处理方式么?下一篇也会从String源代码的角度对这里的分析进行一个佐证。