Informix restore hangs and then fails with status 5 or 2826; socket read failed: errno = 62 - Timer expired

Article:TECH177248  |  Created: 2011-12-19  |  Updated: 2011-12-20  |  Article URL http://www.symantec.com/docs/TECH177248
Article Type
Technical Solution


Issue



Some Informix restores are successful, others always fail after the media is mounted. 

There is a long delay once the media is mounted and position and the backup image does not seem to be read and transfered to the client.

 


Error



The Job Details show that the restore timed out after 5 minute.

10:59:33 - Info tar (pid=0) Restore started.
11:04:39 - Error bpbrm (pid=21587) socket read failed: errno = 62 - Timer expired
11:04:39 - Info bpbrm (pid=21572) sending media manager msg: STOP RESTORE myclient_1323161458
11:04:40 - Info bpbrm (pid=21572) media manager for backup id myclient_1323161458 exited with status 150: termination requested by administrator
11:04:40 - Info bpbrm (pid=21572) sending media manager msg: TERMINATE
11:04:42 - restored from image bnhsrvr1_1323161458; restore time: 0:05:17
11:04:44 - end Restore; elapsed time 0:05:34
Failed to get status code information (2826)



 

The infxbsa debug log shows a successful and timely restore of the rootdbs.


11:46:27.885 [24429] <4> RestoreFileObjects: INF - restoring File:</bnhods1/rootdbs/0>
...snipped data transfer...
11:48:25.223 [24429] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
11:48:35.227 [24429] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed

Next onbar requests a restore of another dbspace.  Once infxbsa receives the tar header and first buffer from bptm, it disables verbosity (to keep the debug log from growing too rapidly) and waits for onbar to accept the buffer.  The log ends/hangs at this point until the job fails.  Onbar does not take the buffer.

11:48:48.927 [24429] <4> RestoreFileObjects: INF - restoring File:</bnhods1/firstdbs/0>
...snip...
11:49:29.820 [24429] <4> readTarHeader: INF - image to restore = 159 GBytes + 402954240 Bytes
....snip...
11:49:29.821 [24429] <4> VxBSAGetEnv: INF - entering GetEnv - NBBSA_VERBOSE_LEVEL
11:49:29.822 [24429] <4> VxBSAGetEnv: INF - returning - 5
11:49:29.822 [24429] <4> BSAGetData: Set verbosity to 0 for the next iterations of GetData
11:49:29.822 [24429] <4> VxBSASetEnv: INF - entering SetEnv - NBBSA_VERBOSE_LEVEL
11:49:29.822 [24429] <4> VxBSAGetEnv: INF - entering GetEnv - NBBSA_VERBOSE_LEVEL
11:49:29.822 [24429] <4> VxBSAGetEnv: INF - returning - 0

The bpbrm debug log confirms that the media was ready and that Client Read Timeout expired after 5 minutes due to lack of progress indication from the client.

11:49:23.478 [5199] <2> bpbrm read_media_msg: read from media manager: MEDIA READY
11:49:23.480 [5199] <2> bpbrm signal_bpbrm_child: sending SIGUSR1 to bpbrm child 5221
11:49:23.481 [5221] <2> Bpbrm_siginfo_print: 0: delay 1 signo SIGUSR2:17 code 0 pid 5199
11:49:23.482 [5221] <2> Bpbrm_siginfo_print: 1: delay 0 signo SIGUSR1:16 code 0 pid 5199
11:49:23.483 [5221] <2> bpbrm mm_sig: received ready signal from media manager
11:54:24.014 [5199] <2> bpbrm send_parent_msg: KEEP_ALIVE 1
11:54:24.117 [5199] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
11:54:24.119 [5199] <2> bpbrm multiplexed_restore: bpbrm.c.16766: keep-alive acknowleged: 0 0 0x00000000
11:54:33.500 [5221] <16> bpbrm readline: socket read failed: errno = 62 - Timer expired

 


Environment



NetBackup 6.0, 6.5, 7.0 (any version can be affected)
Solaris 10 (any platform can be affected)
The dbspace volumes were on an NFS mount point

 


Cause



When the restore starts, onbar requests that NetBackup mount the media while the Informix oninit process is initializing the dbspace's chunks.  During the initialization process, the Informix restore log shows a 4 minute delay preparing the destination filespace for each chunk.

11:47:58  The chunk path (/usr/informix/rootdbs_ch1.p:0) is renamed to new chunk path (/rinformix/chunks/rootdbs_ch1.p:0)
11:49:29  The chunk path (/usr/informix/firstdbs_ch1.p:0) is renamed to new chunk path (/rinformix/chunks/firstdbs_ch1.p:0)
11:53:41  The chunk path (/usr/informix/firstdbs_ch2.p:0) is renamed to new chunk path (/rinformix/chunks/firstdbs_ch2.p:0)
...snip 74 lines...
17:05:39  The chunk path (/usr/informix/firstdbs_ch77.p:0) is renamed to new chunk path (/rinformix/chunks/firstdbs_ch77.p:0)
17:09:29  The chunk path (/usr/informix/firstdbs_ch78.p:0) is renamed to new chunk path (/rinformix/chunks/firstdbs_ch78.p:0)
17:13:27  The chunk path (/usr/informix/firstdbs_ch79.p:0) is renamed to new chunk path (/rinformix/chunks/firstdbs_ch79.p:0)

This restore requires 80 chunks.  The total time taken for this step was 5 hours 25 minutes.  During that period, Informix does not communicate any progress to Netbackup.  This causes the Netbackup bpbrm process to timeout when the media server client read timeout expires as there doesn't appear to be any forward progress and the storage resources should be released for use by other jobs.

In this case, Informix was restoring the non-root dbspaces to non-local volumes that were NFS mounted.  The network segment to the remote volumes was heavily congested, causing the initialization of the chunks to take hours instead of seconds.
 


Solution



If the write performance to the destinations volumes cannot be improved, increase the Client Read Timeout on the media server to a value long enough retain the NetBackup resources until the chunk initialization is complete.

 




Article URL http://www.symantec.com/docs/TECH177248


Terms of use for this information are found in Legal Notices