彻底弄懂UTF-8、Unicode、宽字符、locale - c++编程基础

TOP

彻底弄懂UTF-8、Unicode、宽字符、locale(二)

2019-07-08 14:10:48 【大中小】浏览:156次

n () { time_t currtime; struct tm *timer; char buffer[80]; time( &currtime ); timer = localtime( &currtime ); printf("Locale is: %s\n", setlocale(LC_TIME, "en_US.iso88591")); strftime(buffer,80,"%c", timer ); printf("Date is: %s\n", buffer); printf("Locale is: %s\n", setlocale(LC_TIME, "zh_CN.UTF-8")); strftime(buffer,80,"%c", timer ); printf("Date is: %s\n", buffer); printf("Locale is: %s\n", setlocale(LC_TIME, "")); strftime(buffer,80,"%c", timer ); printf("Date is: %s\n", buffer); return(0); }

编译后运行结果如下：

Locale is: en_US.iso88591
Date is: Sun 07 Jul 2019 04:08:39 PM CST
Locale is: zh_CN.UTF-8
Date is: 2019年07月07日 星期日 16时08分39秒
Locale is: zh_CN.UTF-8
Date is: 2019年07月07日 星期日 16时08分39秒

可以看到对LC_TIME设置不同的值后，调用strftime()会产生不同的结果。
char* setlocale (int category, const char* locale);可以用来对当前程序进行地域设置。
category：用于指定设置影响的范围，LC_CTYPE影响字符分类和字符转换，LC_TIME影响日期和时间的格式，LC_ALL影响所有内容。
locale：用于指定变量的值，上例中分别使用了"en_US.iso88591"，"zh_CN.UTF-8"和空字符串""，""表示使用当前操作系统默认的区域设置。

参考资料：
setlocale()

为什么需要宽字符类型

“你好”对应的Unicode分别为"U+4f60"和"U+597d”，对应的UTF-8编码分别为“0xe4 0xbd 0xa0”和“0xe5 0xa5 0xbd”

多字节字符串在编译后的可执行文件以UTF-8编码保存

#include <stdio.h>
#include <string.h>

int main(void) {
    char s[] = "你好";
    size_t len = strlen(s);
    printf("len = %d\n", (int)len);
    printf("%s\n", s);
    return 0;
}

编译后执行，输出如下：

len = 6
你好

od编译后的可执行文件，可以发现"你好"以UFT-8编码保存，也就是“0xe4 0xbd 0xa0”和“0xe5 0xa5 0xbd”6个字节。
strlen()函数只管结尾的0字节而不管字符串里存的是什么，所以len是6，也就是“你好”的UFT-8编码的字节数。
printf("%s\n", s);相当于将“0xe4 0xbd 0xa0”和“0xe5 0xa5 0xbd”6个字节write到当前终端的设备文件，如果当前终端的驱动程序能识别UTF-8编码就能打印汉字，如果当前字符终端的驱动程序不能识别UTF-8就打印不出汉字。

宽字符串在编译后可执行文件中以Unicode保存

#include <wchar.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
    setlocale(LC_ALL, "zh_CN.UTF-8");   //设置locale
    wchar_t s[] = L"你好";
    size_t len = wcslen(s);
    printf("len = %d\n", (int)len);
    printf("%ls\n", s);
    return 0;
}

编译后执行，输出如下：

len = 2
你好

对编译后的可执行文件执行od命令，可以找到如下这些字节：

193 0003020 001  \0 002  \0   `   O  \0  \0   }   Y  \0  \0  \n  \0  \0  \0
194                00020001        00004f60        0000597d        0000000a

00004f60正是“你”对应的Unicode，0000597d是“好”对应的Unicode。所以对于宽字符串是按Unicode保存在可执行文件中的。
wchar_t是宽字符类型。在字符常量或者字符串前加L就表示宽字符常量或者宽字符串。所以len是2。
wcslen()和strlen()不同，不是见到0字节就结束而是要遇到UCS编码为0的字符才结束。
目前宽字符在内存中以Unicode进行保存，但是要write到终端仍然需要以多字节编码输出，这样终端驱动程序才能识别，所以printf在内部把宽字符串转换成多字节字符串，然后write出去。这个转换过程受locale影响，setlocale(LC_ALL, "zh_CN.UTF-8");设置当前进程的LC_ALL为zh_CN.UTF-8，所以printf将Unicode转成多字节的UTF-8编码，然后write到终端设备。如果将setlocale(LC_ALL, "zh_CN.UTF-8");改为setloc

首页上一页 1 2 3 下一页尾页 2/3/3
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：【NOIP2015普及组】推销员（纪中..	下一篇：DFS和BFS的比较