Netbackup 7.5 Duplication Failing after 2 Hours - Cisco ASA timeout?
Hi All,
We have an issue whereby our duplication is failing after exactly 2 hours.
We are duplicating across a Virgin media managed WAN link and the NBUs are behind Cisco Asa Firewalls and we suspect this is a TCP connection timeout issue.
The issue reporting on the NBU appliance is as follows:
14/02/2013 13:45:25 - begin Duplicate
14/02/2013 13:45:25 - requesting resource LCM_nbu-xxxx_dedupe_stu
14/02/2013 13:45:25 - granted resource LCM_nbu-xxxx_dedupe_stu
14/02/2013 13:45:25 - started process RUNCMD (9872)
14/02/2013 13:45:25 - ended process 0 (9872)
14/02/2013 13:45:26 - requesting resource nbu-xxxx_dedupe_stu
14/02/2013 13:45:26 - reserving resource @aaabK
14/02/2013 13:45:26 - reserved resource @aaabK
14/02/2013 13:45:26 - granted resource MediaID=@aaabL;DiskVolume=PureDiskVolume;DiskPool=dp_disk_nbu-xxxx;Path=PureDiskVolume;StorageServer=nbu-xxxx;MediaServer=nbu-xxxx
14/02/2013 13:45:26 - granted resource nbu-xxxx_dedupe_stu
14/02/2013 13:45:27 - requesting resource @aaabK
14/02/2013 13:45:28 - Info Duplicate(pid=9872) Initiating optimized duplication from @aaabK to @aaabL
14/02/2013 13:45:28 - granted resource MediaID=@aaabK;DiskVolume=PureDiskVolume;DiskPool=dp_disk_nbu-yyyy;Path=PureDiskVolume;StorageServer=nbu-yyyy;MediaServer=nbu-xxxx
14/02/2013 13:46:09 - Info bpdm(pid=22962) started
14/02/2013 13:46:09 - started process bpdm (22962)
14/02/2013 13:46:11 - Info bpdm(pid=22962) requesting nbjm for media
14/02/2013 13:46:21 - begin writing
14/02/2013 13:46:24 - end writing; write time: 00:00:03
14/02/2013 13:46:26 - begin writing
14/02/2013 13:46:29 - end writing; write time: 00:00:03
14/02/2013 13:46:30 - begin writing
14/02/2013 13:46:34 - end writing; write time: 00:00:04
14/02/2013 13:46:35 - begin writing
14/02/2013 13:46:38 - end writing; write time: 00:00:03
14/02/2013 13:46:40 - begin writing
14/02/2013 13:46:43 - end writing; write time: 00:00:03
14/02/2013 13:46:44 - begin writing
14/02/2013 13:46:47 - end writing; write time: 00:00:03
14/02/2013 13:46:49 - begin writing
14/02/2013 13:46:52 - end writing; write time: 00:00:03
14/02/2013 13:46:53 - begin writing
14/02/2013 13:46:56 - end writing; write time: 00:00:03
14/02/2013 13:46:57 - begin writing
14/02/2013 13:47:00 - end writing; write time: 00:00:03
14/02/2013 13:47:01 - begin writing
14/02/2013 13:47:05 - end writing; write time: 00:00:04
14/02/2013 13:47:06 - begin writing
14/02/2013 13:47:09 - end writing; write time: 00:00:03
14/02/2013 13:47:10 - begin writing
14/02/2013 13:47:13 - end writing; write time: 00:00:03
14/02/2013 13:47:15 - begin writing
14/02/2013 13:47:17 - end writing; write time: 00:00:02
14/02/2013 13:47:19 - begin writing
14/02/2013 13:47:22 - end writing; write time: 00:00:03
14/02/2013 13:47:23 - begin writing
14/02/2013 13:47:26 - end writing; write time: 00:00:03
14/02/2013 13:47:27 - begin writing
14/02/2013 13:47:31 - end writing; write time: 00:00:04
14/02/2013 13:47:32 - begin writing
14/02/2013 13:47:35 - end writing; write time: 00:00:03
14/02/2013 13:47:36 - begin writing
14/02/2013 13:47:39 - end writing; write time: 00:00:03
14/02/2013 13:47:40 - begin writing
14/02/2013 13:47:44 - end writing; write time: 00:00:04
14/02/2013 13:47:45 - begin writing
14/02/2013 13:47:48 - end writing; write time: 00:00:03
14/02/2013 13:47:50 - begin writing
14/02/2013 13:47:53 - end writing; write time: 00:00:03
14/02/2013 13:47:55 - begin writing
14/02/2013 15:02:29 - end writing; write time: 01:14:34
14/02/2013 15:02:58 - Info bpdm(pid=22962) EXITING with status 0
14/02/2013 15:02:59 - Info nbu-xxxx(pid=22962) StorageServer=PureDisk:nbu-yyyy; Report=PDDO Stats for (nbu-yyyy): scanned: 534147084 KB, CR sent: 15703012 KB, CR sent over FC: 0 KB, dedup: 97.1%
14/02/2013 15:45:30 - Error bpduplicate(pid=9872) socket read failed: errno = 10054 - An existing connection was forcibly closed by the remote host.
14/02/2013 15:45:30 - Error bpduplicate(pid=9872) host nbu-xxxx backup id sap-enp5.cbc.int_1360782263 optimized duplication failed, file read failed (13).
14/02/2013 15:45:30 - Error bpduplicate(pid=9872) Duplicate of backupid sap-enp5.cbc.int_1360782263 failed, file read failed (13).
14/02/2013 15:45:30 - Error bpduplicate(pid=9872) Status = no images were successfully processed.
14/02/2013 15:45:30 - end Duplicate; elapsed time: 02:00:05
file read failed(13)
===========================================
A log on one of the firewalls at the time of the failure is as follows:
| 6 | Feb 14 2013 | 15:39:26 | 106015 | nbu2.x.y | 39775 | 10.x.x.x | 1556 | Deny TCP (no connection) from nbu2.x.y/39775 to 10.102.5.29/1556 flags ACK on interface xxxxxx |
And the config for both firewalls are:
arp timeout 3600
timeout xlate 3:00:00
timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02
timeout sunrpc 2:00:00 h323 2:00:00 h225 2:00:00 mgcp 0:05:00
timeout mgcp-pat 0:05:00 sip 0:30:00 sip_media 0:02:00
timeout sip-invite 0:03:00 sip-disconnect 0:02:00
timeout uauth 0:05:00 absolute
arp timeout 14400
timeout xlate 3:00:00
timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02
timeout sunrpc 2:00:00 h323 2:00:00 h225 2:00:00 mgcp 0:05:00
timeout mgcp-pat 0:05:00 sip 0:30:00 sip_media 0:02:00
timeout sip-invite 0:03:00 sip-disconnect 0:02:00
timeout uauth 0:05:00 absolute
Comments 3 Comments • Jump to latest comment
Hi Mark,
The crazy thing is we are now experiencing the sam issue over a WAN link except we are behind Fortigate 310B Firewalls.
Also from what i could gather is that timeouts are only related to stale connections and not active connections with data traversing them...
Could be wrong, but we've checked and rechecked our OS timeouts and Firewalls and they are not causing the problem, supposedly.
The firewall setting can certainly cause this even though the link should be active. The issue being that a steady stream of data is working but the communication between media servers regarding the actual progress of that stream is using a different line of communication and it is this link that gets broken.
So the data is flowing and Ok but when a progress report is expected it does not get one and so considers the stream to have failed and kills the job.
The servers themselves can also cause it as a result of keep alive settings - appliances already have the keepalive interval, probes and times optimised so they should be fine.
The Master Server may need its setting done to match to help too (varies depending on the O/S) but your 2 hour setting can still affect things as it is not the data flow that is the issue it is the progress communications that can cause it all to fail - best to put them right up.
If they are IP Specific then make sure that media server to media server and media servers to master servers are all allowed with a good timeout
Hope this helps
Authorised Symantec Consultant
Don't forget to give a "Thumbs Up" or mark as "Solution" if someones advice has helped you.
Would you like to reply?
Login or Register to post your comment.