|
11.2.0.3RAC(VCS)节点crash以及hang的问题分析(五)
ue to high load. Previously detected and output hangs are not displayed again. Instead, the 'Victim Information' section will indicate that the victim is from an 'Existing Hang' under the 'Previous Hang' column. 'Verified Hangs' below indicate one or more hangs that were found and identify the final blocking session and instance on which they occurred. Since the current hang resolution state is 'PROCESS', any hangs requiring session or process termination will be automatically resolved. Any hangs with a 'Hang Resolution Action' of 'Unresolvable' will be ignored. These types of hangs will either be resolved by another layer in the RDBMS or cannot be resolved as they may require external intervention. Deadlocks (also named cycles) are currently NOT resolved even if hang resolution is enabled. The 'Hang Type' of DLCK in the 'Verified Hangs' output identifies these hangs. Below that are the complete hang chains from the time the hang was detected. The following information will assist Oracle Support Services in further analysis of the root cause of the hang. *** 2014-04-22 17:22:01.537 Verified Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotA Hang Ignored Reason: Since instance termination is not allowed, automatic hang resolution cannot be performed. inst# SessId Ser# OSPID PrcNm Event ----- ------ ----- --------- ----- ----- 2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm 2 833 1 13788 CKPT control file sequential read |
这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程其实是被CKPT所阻塞,我们也可以来看下该进程 的情况,如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
*** 2014-04-22 17:27:00.778 Process diagnostic dump for oracle@xhdb-server4 (M000), OS id=10646, pid: 57, proc_ser: 143, sid: 291, sess_ser: 59157 ------------------------------------------------------------------------------- current sql: select tablespace_id, rfno, allocated_space, file_size, file_maxsize, changescn_base, changescn_wrap, flag, inst_id from sys.ts$, GV$FILESPACE_USAGE where ts# = tablespace_id and online$ != 3 and (changescn_wrap > PITRSCNWRP or (changescn_wrap = PITRSCNWRP and changescn_base >= PITRSCNBAS)) and inst_id != :inst and (changescn_wrap > :w or (changescn_wrap = :w and changescn_base >= :b)) Current Wait Stack: 0: waiting for 'DFS lock handle' type|mode=0x43490005, id1=0xa, id2=0x2 wait_id=6 seq_num=7 snap_id=1 wait times: snap=6 min 12 sec, exc=6 min 12 sec, total=6 min 12 sec wait times: max=infinite, heur=6 min 12 sec wait counts: calls=818 os=818 in_wait=1 iflags=0x15a2 There is at least one session blocking this session. Dumping 2 direct blocker(s): inst: 2, sid: 833, ser: 1 inst: 1, sid: 482, ser: 1 Dumping final blocker: inst: 2, sid: 833, ser: 1 -----最终的blocker是833,也就是Node2节点的CKPT进程。 There are 1 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 581, ser: 36139 wait event: 'DFS lock handle' p1: 'type|mode'=0x43490005 p2: 'id1'=0xa p3: 'id2'=0x5 |
从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里或许有人会说,问题的原因 应该很明确了,由于Node2 ckpt的异常,到这Node2 节点Lmon进程异常,由于需要和Node1的Lmon进程 进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。
其实不然,到最后至始至终我们都没有完全弄清楚为何CKPT进程会等待这么长时间 ?
到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|