MySQL”海量数据”查询性能分析 - 数据库编程

最近做了一次MySQL所谓的”海量数据”查询性能分析.

表结构

dt	dt2	dt3	it	it2	it3
id	id	id	id	id	id	int PK
		ext1			ext1	varchar(256)
time	time	time	time	time	time	int/datetime KEY
	ext2	ext2		ext2	ext2	varchar(128)

说明, MyISAM引擎, dt表示时间字段使用datetime类型, it表示时间字段使用int类型.

初始数据

首先生成100K个UNIX时间戳(int), 然后随机选取10M次, 每一次往6个表里插入一条记录(当time字段是datetime类型时, 做类型转换). 所以每一个表都有10M条记录. ext1和ext2字段会用随机的字符串填充.

SQL查询

使用的查询SQL语句如:

select SQL_NO_CACHE count(*) from it where time>10000;
select SQL_NO_CACHE count(*) from dt where time>from_unixtime(10000);
select SQL_NO_CACHE * from it where time>10000 order by time limit 1;
select SQL_NO_CACHE * from it use key(PRIMARY) where time>10000 order by id limit 1;

SQL_NO_CACHE用于消除查询结果缓存的影响. use key用于指定查询时使用的索引. 统计每一条SQL的执行时间(单位s)和满足WHERE条件的记录总数(total), it-tm表示在dt表上执行SQL的耗时, 并explain得到key和extra, 结果如下.

where	total	select	orderby	key	it-tm	dt-tm	it2-tm	dt2-tm	it3-tm	dt3-tm	extra
time>10000	8999050	count(*)		time	3.52	4.28	3.74	4.49	3.53	4.47	where; index
		count(time)		time	3.44	4.00	3.69	4.36	3.56	4.26	where; index
		count(id)		NULL	1.44	1.92	4.30	4.60	4.79	4.98	where
		*	time	time	0.00	0.00	0.00	0.00	0.00	0.00	where
		*	id	time	14.81	15.38	19.37	20.30	20.94	21.42	where; filesort
		*	id	PK	0.00	0.03	0.00	0.02	0.00	0.04	where

time>50000	4987990	count(*)			1.90	2.36	2.02	2.41	1.99	2.42
		count(time)			1.90	2.23	2.01	2.32	1.96	2.29
		count(id)			1.48	1.91	4.25	4.61	4.80	5.12
		*	time		0.00	0.00	0.00	0.00	0.00	0.00
		*	id		8.15	8.77	10.74	11.36	11.59	11.79
		*	id		0.00	0.00	0.00	0.00	0.00	0.00

time>80000	1991982	count(*)			0.76	0.95	0.83	0.98	0.80	1.00
		count(time)			0.77	0.91	0.81	0.91	0.83	0.92
		count(id)			1.38	1.86	4.17	4.49	4.71	5.02
		*	time		0.00	0.00	0.00	0.00	0.00	0.00
		*	id		3.26	3.44	4.26	4.51	4.56	4.76
		*	id		0.00	0.00	0.00	0.00	0.00	0.00

time>99900	10871	count(*)			0.00	0.00	0.00	0.01	0.01	0.00
		count(time)			0.01	0.01	0.01	0.00	0.01	0.01
		count(id)			0.01	0.01	0.02	0.03	0.02	0.02
		*	time		0.00	0.00	0.00	0.00	0.00	0.00
		*	id		0.02	0.02	0.03	0.03	0.03	0.03
		*	id		0.00	0.00	0.00	0.00	0.00	0.00

分析和结论

count(*), count(time)和count(id)的对比. 结果有较大变化. 当表的字段只有2个且查询条件较宽松(即符合条件的记录数较多)时, count(id)比count(*)快很多, 但是, 当表中还有其它的字段时, count(id)反而更慢了. 虽然id是主键, time是索引列, 但是select count(id) where time并没有用到索引, 而是进行全表扫描. 当对count(*)进行ignore key(time)时, 查询时间和count(id)相同.证明当结果集较大时索引导致查询变慢,应该是全表扫描进行的是连续的磁盘IO和内存操作, 而使用索引是进行随机的磁盘IO和内存操作, 并且MyISAM存储索引的BTree结构占用更多的空间. 当WHERE条件约束更严格, total的值小到一定程度时, 全表扫描比使用索引慢, 因为索引极大减少了磁盘IO和内存操作.

排序字段和索引的使用. 当有排序且LIMIT(偏移为0)时,如果查询时使用的索引不是排序字段的索引, 那么速度非常慢. 当偏移不为0时, 如果使用排序列的索引, 要考虑偏移可能导致扫描的记录数, 所以应该根据情况选取合适的索引.

判断符合条件的记录是否存在, 使用select * limit 1速度要比select count(*)计数快得多.

时间字段类型的选择. int比datetime快, 但差距不是很明显.

无论如何, 条件限制得越严格, 查询就会越快.

另外, 根据随机id更新时, 大约能达到5K行/s.

列的先后顺序对查询性能的影响也非常大.