生产上数据库大量的latchfree导致的CPU资源耗尽的问题的解决 - 数据库编程

中午的时候，我们生产上的某个数据库，cpu一直居高不下

通过如下的sql语句，我们查看当时数据库的等待，争用的情况：

select s.SID,
       s.SERIAL#,
       'kill -9 ' || p.SPID,
       s.MACHINE,
       s.OSUSER,
       s.PROGRAM,
       s.USERNAME,
       s.last_call_et,
       a.SQL_ID,
       s.LOGON_TIME,
       a.SQL_TEXT,
       a.SQL_FULLTEXT,
       w.EVENT,
       a.DISK_READS,
       a.BUFFER_GETS
  from v$process p, v$session s, v$sqlarea a, v$session_wait w
 where p.ADDR = s.PADDR
   and s.SQL_ID = a.sql_id
   and s.sid = w.SID
   and s.STATUS = 'ACTIVE'
 order by s.last_call_et desc;

从event可以看到，是latch 的争用导致的原因

通过如果的sql，查看是什么样的latch

select * from v$session_wait 
where event  like 'latch free';

P2就是这个latch的name，通过v$latchname这个视图就可以知道哪个具体的latch

1:45:55 PM SQL> select * from v$latchname where latch#=164;
 
    LATCH# NAME                                                                   HASH
---------- ---------------------------------------------------------------- ----------
       164 simulator hash latch                                             2233208730

查看latch的历史情况

2:11:59 PM SQL> select name,gets,misses,sleeps from v$latch where sleeps >0 order by sleeps desc;
 
NAME                                                                   GETS     MISSES     SLEEPS
---------------------------------------------------------------- ---------- ---------- ----------
simulator hash latch                                             4827860212  135426899   10890947
cache buffers chains                                             1619822817 2850976006    4747728
gc element                                                       4660052091   25748270     175073
resmgr:schema config                                               91872524     153968      95708
ges resource hash list                                            174151449    1070556      55459
Real-time plan statistics latch                                    40953155     651496      44527
call allocation                                                     3301878     265908      43501
row cache objects                                                 336300485    4970324      19366

这个simulator hash latch已经是显著的latch部分

eagle在他的网站上有篇文章讲到了关于simulator这个

http://www.eygle.com/archives/2011/11/simulator_lru_latch.html

simulator意为模拟，也就是说当Oracle在内存中进行数据块处理时，实际上还会在预先分配的Buffer中进行相关信息记录，如DBA信息，当数据块被老化之后，下次读取时，如果请求的数据在Simulator内存中存在，则认为继续缓存该数据块是有意义的，通过监控并模拟统计这些操作，并对计算结果加权运算，就可以实现对于内存的调整建议。
在模拟过程中，也是通过Latch来实现的，相关的Latch就有 simulator lru latch 、 simulator hash latch等.

就Buffer Cache而言，如果系统中该类争用严重，则可以考虑关闭db_cache_advice，消除这部分内部操作对于性能的影响。
以下是一个相关BUG，在该Bug中，由于DB_CACHE_ADVICE的开启导致了严重的simulator lru latch的竞争：

Bug 5918642 Heavy latch contention with DB_CACHE_ADVICE on
This note gives a brief overview of bug 5918642.
The content was last updated on: 01-APR-2008
Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (Rdbms)

Range of versions believed to be affected Versions < 11.2

Versions confirmed as being affected

10.2.0.3

Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in

11.2 (Future Release)

10.2.0.4 (Server Patch Set)

11.1.0.7 (Server Patch Set)

Symptoms:

Related To:

Latch Contention

Waits for "latch free"

Performance Monitoring

DB_CACHE_ADVICE

Description
High simulator lru latch contention can occur when db_cache_advice is
set to ON if there is a large buffer cache.


Workaround:
  Set db_cache_advice to OFF

当然，这个只是治标不治本的做法，这个是显现的表象的问题，根源的问题还是这个sql语句有问题

当一个数据块读入到sga中时，该块的块头(buffer header)会放置在一个hash bucket的链表(hash chain)中。该内存结构由一系列cache buffers chains子latch保护（又名hash latch或者cbc latch）。对Buffer cache中的块，要select或者update、insert,delete等，都得先获得cache buffers chains子latch，以保证对chain的排他访问。若在过程中发生争用，就会等待latch:cache buffers chains事件。

产生原因： 1. 低效率的SQL语句（主要体现在逻辑读过高）在某些环境中，应用程序打开执行相同的低效率SQL语句的多个并发会话，这些SQL语句都设法得到相同的数据集，每次执行都带有高 BUFFER_GETS(逻辑读取)的SQL语句是主要的原因。相反，较小的逻辑读意味着较少的latch get操作，从而减少锁存器争用并改

*Product (Component)*	Oracle Server (Rdbms)
*Range of versions believed* to be affected**	Versions < 11.2
*Versions confirmed* as being affected**	10.2.0.3
Platforms affected	Generic (all / most platforms affected)

Symptoms:	Related To:
Latch Contention Waits for "latch free"	Performance Monitoring DB_CACHE_ADVICE

生产上数据库大量的latchfree导致的CPU资源耗尽的问题的解决(一)

Bug 5918642 Heavy latch contention with DB_CACHE_ADVICE on

Affects:

Fixed:

Symptoms:

Related To:

Description