网络故障造成备库standby logfile的损坏

某生产库exadata,oracle 11.2.0.4,客户说物理备库与主库不能进行同步,如是要客户把生产库与备库的alert_sid.log文件打包传给我。查看了一个主库的alert.log日志文件发现从15号开始网络就断断续续的,信息如下:

Wed Jun 15 14:25:48 2016


***********************************************************************

Fatal NI connect error 12543, connecting to:
 (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.137.8.101)(PORT=1521)))

(CONNECT_DATA=(SERVICE_NAME=syyjm)(CID=(PROGRAM=oracle)(HOST=yb01dbadm01.dnsserver)

(USER=oracle))))

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 15-JUN-2016 14:25:48
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12543
    
TNS-12543: TNS:destination host unreachable
    ns secondary err code: 12560
    nt main err code: 513
    
TNS-00513: Destination host unreachable
    nt secondary err code: 113
    nt OS err code: 0


***********************************************************************

Fatal NI connect error 12543, connecting to:
 (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.137.8.101)(PORT=1521)))

(CONNECT_DATA=(SERVICE_NAME=syyjm)(CID=(PROGRAM=oracle)(HOST=yb01dbadm01.dnsserver)

(USER=oracle))))

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 15-JUN-2016 14:25:51
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12543
    
TNS-12543: TNS:destination host unreachable
    ns secondary err code: 12560
    nt main err code: 513
    
TNS-00513: Destination host unreachable
    nt secondary err code: 113
    nt OS err code: 0
Wed Jun 15 14:26:02 2016
...省略...

对应的备库这个时间点的alert.log日志文件信息如下:

Wed Jun 15 14:24:47 2016


***********************************************************************

Fatal NI connect error 12170.
Wed Jun 15 14:24:47 2016


***********************************************************************

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

Fatal NI connect error 12170.
  Time: 15-JUN-2016 14:24:47

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Tracing not turned on.
  Time: 15-JUN-2016 14:24:47
  Tns error struct:
  Tracing not turned on.
    ns main err code: 12535
  Tns error struct:
    
    ns main err code: 12535
    
TNS-12535: TNS:operation timed out
TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    ns secondary err code: 12560
    nt main err code: 505
    nt main err code: 505
    
    
TNS-00505: Operation timed out
TNS-00505: Operation timed out
    nt secondary err code: 110
    nt secondary err code: 110
    nt OS err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.8.73)(PORT=12282))
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.8.73)(PORT=12091))
RFS[4]: Possible network disconnect with primary database
RFS[5]: Possible network disconnect with primary database
Wed Jun 15 14:30:55 2016
Killing 1 processes with pids 25908 (idle RFS by thread/sequence) in order to allow the 

in-transit gap to be queued. Requested by OS process 58349
Wed Jun 15 14:30:57 2016
RFS[10]: Assigned to RFS process 58352
RFS[10]: Selected log 8 for thread 1 sequence 40085 dbid 643576469 branch 867103448
Wed Jun 15 14:30:58 2016
Archived Log entry 3997 added for thread 1 sequence 40085 ID 0x265be592 dest 1:
Wed Jun 15 14:30:59 2016
Primary database is in MAXIMUM PERFORMANCE mode
RFS[11]: Assigned to RFS process 58363
RFS[11]: Selected log 7 for thread 1 sequence 40087 dbid 643576469 branch 867103448
RFS[10]: Selected log 8 for thread 1 sequence 40086 dbid 643576469 branch 867103448
Wed Jun 15 14:31:01 2016
Killing 1 processes with pids 25910 (idle RFS by thread/sequence) in order to retry 

receiving a log after reattaching. Requested by OS process 58367
Wed Jun 15 14:31:01 2016
Archived Log entry 3998 added for thread 1 sequence 40086 ID 0x265be592 dest 1:
RFS[12]: Assigned to RFS process 58367
RFS[12]: Selected log 10 for thread 2 sequence 

35718 dbid 643576469 branch 867103448

从信息RFS[4]: Possible network disconnect with primary database可以看到备库不能与主库进行连接。从信息Killing 1 processes with pids 25908 (idle RFS by thread/sequence) in order to allow the in-transit gap to be queued. Requested by OS process 58349,可知因为不能从主库接受归档日志,且超时而被进程58349给kill掉了,而58349进程是另一个RFS进程用于接受主库所传输过来的归档日志。也就是说,因为网络连接出现了问题,RFS进程在不断的kill,restart重复这样的操作

在2016-06-15 14:34:25这个时间点由于网络恢复日志传输恢复正常,但在2016-05-15 16:59:00这个时间点网络又断了,不能进行日志传输

Wed Jun 15 14:34:25 2016
ARC3: Standby redo logfile selected for thread 1 sequence 40085 for destination 

LOG_ARCHIVE_DEST_2
Wed Jun 15 14:34:27 2016
Thread 1 advanced to log sequence 40087 (LGWR switch)
  Current log# 1 seq# 40087 mem# 0: +DATAC1/yyjm/onlinelog/group_1.261.867103449
Wed Jun 15 14:34:27 2016
Archived Log entry 139166 added for thread 1 sequence 40086 ID 0x265be592 dest 1:
Wed Jun 15 14:34:27 2016
LNS: Standby redo logfile selected for thread 1 sequence 40087 for destination 

LOG_ARCHIVE_DEST_2
ARC3: Standby redo logfile selected for thread 1 sequence 40086 for destination 

LOG_ARCHIVE_DEST_2
Wed Jun 15 15:58:19 2016
Thread 1 advanced to log sequence 40088 (LGWR switch)
  Current log# 2 seq# 40088 mem# 0: +DATAC1/yyjm/onlinelog/group_2.262.867103449
Wed Jun 15 15:58:19 2016
LNS: Standby redo logfile selected for thread 1 sequence 40088 for destination 

LOG_ARCHIVE_DEST_2
Wed Jun 15 15:58:20 2016
Archived Log entry 139173 added for thread 1 sequence 40087 ID 0x265be592 dest 1:
Wed Jun 15 16:59:00 2016

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 11.2.0.4.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 15-JUN-2016 16:59:00
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505
    
TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.32.4)(PORT=1222))
Wed Jun 15 17:05:01 2016

从备库日志信息来看,备库其实在2016-06-15 16:08:43这个时间点网络就已经断开了

Wed Jun 15 16:08:43 2016
RFS[9]: Possible network disconnect with primary database
Wed Jun 15 16:08:43 2016
RFS[6]: Possible network disconnect with primary database
Wed Jun 15 17:56:36 2016

这个DG环境的网络时断时连的情况从2016-06-15一直到2016-06-19号,从备库的日志信息可以看到如下信息,不能在备库创建归档日志文件,从信息RFS[23]: No standby redo logfiles available for thread 1 与RFS[24]: No standby redo logfiles available for thread 2,以及ORA-00312: online log 9 thread 1: ‘/data/syyjm/datafile/standy03’可知因为备库的standby log的standy03出现了损坏。

Sun Jun 19 12:15:21 2016
RFS[23]: Assigned to RFS process 18969
RFS[23]: No standby redo logfiles available for thread 1 
Creating archive destination file : /arch/syyjm/1_40277_867103448.dbf (12905 blocks)
Sun Jun 19 12:15:56 2016
Archiver process freed from errors. No longer stopped
Sun Jun 19 12:15:56 2016
Primary database is in MAXIMUM PERFORMANCE mode
Re-archiving standby log 11 thread 2 sequence 36153
RFS[24]: Assigned to RFS process 18976
RFS[24]: No standby redo logfiles available for thread 2 
...省略...
Sun Jun 19 12:21:03 2016
Unable to create archive log file '/arch/syyjm/1_40254_867103448.dbf'
ARC3: Error 19504 Creating archive log file to '/arch/syyjm/1_40254_867103448.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance syyjm - Archival Error
ORA-16038: log 9 sequence# 40254 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'
Sun Jun 19 12:21:03 2016
...省略...
Sun Jun 19 12:31:03 2016
Suppressing further error logging of LOG_ARCHIVE_DEST_1.
Suppressing further error logging of LOG_ARCHIVE_DEST_1.
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance syyjm - Archival Error
ORA-16038: log 9 sequence# 40254 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'
Sun Jun 19 12:31:03 2016
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance syyjm - Archival Error
ORA-16014: log 9 sequence# 40254 not archived, no available destinations
ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'
...省略...
Wed Jun 22 11:22:11 2016
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance syyjm - Archival Error
ORA-16038: log 9 sequence# 40254 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'
Wed Jun 22 11:22:11 2016
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance syyjm - Archival Error
ORA-16014: log 9 sequence# 40254 not archived, no available destinations
ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'

从主库alert日志文件也可以看到与不能进行对远程目录进行归档操作

Wed Jun 22 11:05:22 2016
Thread 1 advanced to log sequence 41141 (LGWR switch)
  Current log# 2 seq# 41141 mem# 0: +DATAC1/yyjm/onlinelog/group_2.262.867103449
FAL[server, ARC0]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance yyjm1 - Archival Error. Archiver continuing.
FAL[server, ARC1]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance yyjm1 - Archival Error. Archiver continuing.
Wed Jun 22 11:05:23 2016
Archived Log entry 142098 added for thread 1 sequence 41140 ID 0x265be592 dest 1:

通过主库与备库的alert.log文件信息可以清楚的知道原因是因为DG环境的网络时断时连引起了备库的RFS进程在重复kill,restart操作过程中损坏了standby log文件,解决方法很简单就是重建备库被损不的standby log文件。