止这些Oracle
关键性后台进程spin或不被阻塞。该进程会定时的将监控的信息打印输出在trace文件中,便于我们进行诊断,
这也是11gR2一个亮点。当LMBH进程发现其他核心进程出现异常时,会尝试发起一些kill动作,如何有进程被阻塞的话。
如果一定时间内仍然无法解决,那么将触发保护,将实例强行终止掉,当然,这是为了保证RAC节点数据的完整性和一致性。
这里比较郁闷的是,这个diag的dump并没有产没 /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc
我们首先来看下Node1的Lmon进程的信息:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
*** ACTION NAME:() 2014-04-22 17:26:56.052 *** 2014-04-22 17:26:49.401 ============================== LMON (ospid: 27573) has not moved for 105 sec (1398158808.1398158703) kjfmGCR_HBCheckAll: LMON (ospid: 27573) has status 6 ================================================== === LMON (ospid: 27573) Heartbeat Report ================================================== LMON (ospid: 27573) has no heartbeats for 105 sec. (threshold 70 sec) : Not in wait; last wait ended 89 secs ago. -------------等待了89秒 : last wait_id 165313538 at 'enq: CF - contention'. ============================== Dumping PROCESS LMON (ospid: 27573) States ============================== ===[ System Load State ]=== CPU Total 16 Core 16 Socket 16 Load normal: Cur 988 Highmark 20480 (3.85 80.00) ===[ Latch State ]=== Not in Latch Get ===[ Session State Object ]=== ---------------------------------------- SO: 0x52daba340, type: 4, owner: 0x52f5d8330, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x52f5d8330, name=session, file=ksu.h LINE:12624 ID:, pg=0 (session) sid: 1057 ser: 1 trans: 0x0, creator: 0x52f5d8330 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x409) -/-/INC DID: , short-term DID: txn branch: 0x0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: Not in wait; last wait ended 1 min 29 sec ago There are 1 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 297, ser: 6347 wait event: 'name-service call wait' p1: 'waittime'=0x32 p2: ''=0x0 p3: ''=0x0 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 0 secs, waiter_cache_ver: 30272 Wait State: fixed_waits=0 flags=0x20 boundary=0x0/-1 Session Wait History: elapsed time of 1 min 29 sec since last wait ---LMON进程等待enq: CF - contention,等待了1分29秒,即89秒 0: waited for 'enq: CF - contention' name|mode=0x43460006, 0=0x0, operation=0x3 wait_id=165313538 seq_num=35946 snap_id=1 wait times: snap=1.027254 sec, exc=1.027254 sec, total=1.027254 sec wait times: max=1.000000 sec wait counts: calls=1 os=1 occurred after 0.000109 sec of elapsed time 。。。。。。 |
如下是该进程的资源使用情况:
| 1 2 3 4 5 6 |
*** 2014-04-22 17:26:57.229 loadavg : 3.94 3.80 3.99 swap info: free_mem = 36949.53M rsv = 24548.22M alloc = 23576.62M avail = 45643.61M swap_free = 46615.21M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 O oa 27573 1 6 79 20 799187 Jan 23 1589:29 ora_lmon_authorcl1 |
我们可以看到,系统在该时间点load并不高,Memory也很充足。
这里有一个问题,该节点LMON进程hung的原因是什么? 从日志分析来看,是由于无法获得enq: CF contention。
我们知道ckpt 进程会定时更新操作controlfile,且就需要获得该enqueue。 所有这里我大胆的假设,是由于ckpt持有CF的latch
不释放,导致LMON进程无法获得. 根据这一点我搜mos 发现一个bug,可惜该bug说已经在11.2.0.3中fixed了。
Bug 10276173 LMON hang possible while trying to get access to CKPT progress record
该bug描述说,当在进行reconfig时,lmon会尝试去获得ckpt processes record,会等待enq: CF -contention,会导致hung.
根据文档来看,显然这跟我们的实际情况不符。
下面我们来结合Node2的日志进行综合分析:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3 |