|
这周折腾了2天的时间帮客户成功恢复了一套近1.4TB的10.2.0.5 RAC(ASM). 该库在3月4号直接crash了。
大家可以看到,该库在开始报错读取redo,controlfile报错,本质原因是DISKGROUP dismount了,信息如下:
Tue Mar 04 18:09:59 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc: ORA-00345: redo log write error block 68145 count 5 ORA-00312: online log 6 thread 2: '+DATA/xxxx/onlinelog/o2_t2_redo3.log' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:09:59 CST 2014 SUCCESS: diskgroup DATA was dismounted SUCCESS: diskgroup DATA was dismounted Tue Mar 04 18:10:00 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc: ORA-00202: control file: '+DATA/xxxx/controlfile/o1_mf_4g1zr1yo_.ctl' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x1f41e online=1 file=31 +DATA/xxxx/datafile/apps_ts_queues.310.692585175 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x47d5d online=1 file=51 +DATA/xxx/datafile/apps_ts_tx_data.353.692593409 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc: ORA-00202: control file: '+DATA/prod/controlfile/o1_mf_4g1zr1yo_.ctl' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x47d5b online=1 file=51 +DATA/prod/datafile/apps_ts_tx_data.353.692593409 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014 数据库实例挂了之后,我们来看下ASM实例的alert log信息,如下: Tue Mar 04 18:10:04 CST 2014 NOTE: SMON starting instance recovery for group 1 (mounted) Tue Mar 04 18:10:04 CST 2014 WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 status:2 WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 status:2 NOTE: F1X0 found on disk 0 fcn 0.160230519 WARNING: IO Failed. au:33 diskname:/dev/raw/raw5 rq:0x60000000002d64f0 buffer:0x400405df000 au_offset(bytes):0 iosz:4096 operation:0 status:2 WARNING: cache failed to read gn 1 fn 3 blk 10752 count 1 from disk 2 ERROR: cache failed to read fn=3 blk=10752 from disk(s): 2 ORA-15081: failed to submit an I/O operation to a disk NOTE: cache initiating offline of disk 2 group 1 WARNING: process 12863 initiating offline of disk 2.2526420198 (DATA_0002) with mask 0x3 in group 1 NOTE: PST update: grp = 1, dsk = 2, mode = 0x6 Tue Mar 04 18:10:04 CST 2014 ERROR: too many offline disks in PST (grp 1) Tue Mar 04 18:10:04 CST 2014 ERROR: PST-initiated MANDATORY DISMOUNT of group DATA Tue Mar 04 18:10:04 CST 2014 WARNING: Disk 2 in group 1 in mode: 0x7,state: 0x2 was taken offline Tue Mar 04 18:10:05 CST 2014 NOTE: halting all I/Os to diskgroup DATA NOTE: active pin found: 0x0x40045bb0fd0 Tue Mar 04 18:10:05 CST 2014 Abort recovery for domain 1 Tue Mar 04 18:10:05 CST 2014 NOTE: cache dismounting group 1/0xD916EC16 (DATA) Tue Mar 04 18:10:06 CST 2014 大家可以看到,ASM报了一个ORA-15081错误,在该错误之前是报对其中一个盘/dev/raw/raw5的IO操作错误。 细心的朋友可以看到,这里由于IO 操作异常后,该disk被offline了。最后磁盘组无法mount。 我们测试使用kfed read无法读取该disk,dd也无法操作。但是却可以直接dd 该disk对应的物理盘。 磁盘组无法mount,从其中trace来看显然是磁盘头损坏,如下: WARNING: cache read a corrupted block gn=1 dsk=2 blk=1 from disk 2 OSM metadata block dump: kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.bl |