backups causing server to drop network packets eventually lead to network failure
First a little history... I just migrated netbackup 7.1.03 master/media from solaris10 sparc V440 to Redhat 6.4 netbackup 220.127.116.11 on hardware Dell PowerEdge R720xd. I have a mixture of clients with windows, solaris, Redhat linux. I performed the migration around end of May. Everything ran fine for the first week with all clients except one client. It appears that one particular client running 7.1.03 netbackup client on solaris 9 with hardware SunFire 6800 started to experience latency up to the point of where the server eventually would drop packets and off the network. This server is running rman as well as filesystem level backups. Prior to the migration from solaris10 to redhat, i had not experienced this issue. Even when you cant access the server via ssh or ftp, the backups never fail but nothing can connect to the server(ie oracle, ssh, etc..). However if i get on the client console, the client resources are fine and the server operates normally with the only issue being it cant ping out and nothing can ping the server without dropping packets or pings just stop all together. If I run tcpdump on the master/media the last thing i see is the master talking to the client and the client does not return. I have checked with our network team and they only see an increase in traffic but its not to the point of dropping packets and there are no errors on the switch or nic. Keep in mind this was all working from the solaris10 master media. Below are things i have done to try and resolve the issue with no affect:
- Physically switched cable on the server thus illuminating a bad cable
- Physically switched to a new nic card and a different port on the switch with new cable
- Moved to a different ip from the public nic to exclusive backup nic
- Verified with Network team there are no packets being dropped from the switch
- Moved the SAN from active/active to active/passive… this was to fix the trespassing LUNS
- I have tried to duplicate a load by running 7 @1GB scp transfers simultaneously and the server did not even blink.
- Tried running only one stream versus multiple streams
- Put in exclusions
I am at a lost at the moment... In my mind i know its not a netbackup issue and has to be something with hardware but i just cant find it. I am open for suggestions.