Video Screencast Help

backups causing server to drop network packets eventually lead to network failure

Created: 21 Jun 2013 • Updated: 09 Jul 2013 | 11 comments
This issue has been solved. See solution.

First a little history...  I just migrated netbackup 7.1.03 master/media from solaris10 sparc V440 to Redhat 6.4 netbackup 7.1.0.3 on hardware Dell PowerEdge R720xd.   I have a mixture of clients with windows, solaris, Redhat linux.   I performed the migration around end of May.   Everything ran fine for the first week with all clients except one client.   It appears that one particular client running 7.1.03 netbackup client on solaris 9 with hardware SunFire 6800 started to experience latency up to the point of where the server eventually would drop packets and off the network.   This server is running rman as well as filesystem level backups.   Prior to the migration from solaris10 to redhat, i had not experienced this issue.   Even when you cant access the server via ssh or ftp, the backups never fail but nothing can connect to the server(ie oracle, ssh, etc..).    However if i get on the client console, the client resources are fine and the server operates normally with the only issue being it cant ping out and nothing can ping the server without dropping packets or pings just stop all together.    If I run tcpdump on the master/media the last thing i see is the master talking to the client and the client does not return.    I have checked with our network team and they only see an increase in traffic but its not to the point of dropping packets and there are no errors on the switch or nic.   Keep in mind this was all working from the solaris10 master media.   Below are things i have done to try and resolve the issue with no affect:

  1. Physically switched cable on the server thus illuminating a bad cable
  2. Physically switched to a new nic card and a different port on the switch with new cable
  3. Moved to a different ip from the public nic to exclusive backup nic
  4. Verified with Network team there are no packets being dropped from the switch
  5. Moved the SAN from active/active to active/passive… this was to fix the trespassing LUNS
  6. I have tried to duplicate a load by running 7 @1GB scp transfers simultaneously and the server did not even blink.
  7. Tried running only one stream versus multiple streams
  8. Put in exclusions

I am at a lost at the moment... In my mind i know its not a netbackup issue and has to be something with hardware but i just cant find it.    I am open for suggestions. 

 

 

Operating Systems:

Comments 11 CommentsJump to latest comment

Nagalla's picture

hi ,

my first questions: does this packet drops in ping and server is un accassible through  oracle, ssh, etc.. is only  happening at the time of backups?

what is the Network speed , does  it have 10 Gig port or 1 Gig port?

did you find a chage to run the appcritical test(kind of Network test) with the help of symatec support where the symatec support anylise the test results and provides the inputs?

as long long as you see the packet drops in network, you need to push back to Network team and server team to get it fixed.. there is more deep investigation that needs to done by server and Network team and comeup with the recommendatsion and suggestions to the backups team.

as a backup person we dont have much scope on the server and network level, and netbackup is dependent on those... 

so i would say..

1)check that packet drop is only at the time of backup or not, 

2)get symantec support.. run appcritical test.. take the results and push back to server and Netwrok team.

 

Nicolai's picture

Are you using any sort of load balance or port aggregation like LACP ?

Misconfigured port  aggregation software can show symptoms like the one you mention.

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

devans3428's picture

Thanks Nicolai but no load balance or port aggregation....  Remember this was working with no issue prior to the migration from solaris to redhat.   The one piece of hardware way have not change or swapped is the fiber cables and fiber cards.

devans3428's picture

Some more history...  I was performing some test this morning with just doing single streams backups... Although the backups would complete, I would notice any pings into the client or pings out of the client the the times would go from 0.###ms without any backups running to anywhere upwards of 25.000ms.  The backups were being performed on filesystems that reside on the SAN.  With backing up local storage on the client the ping times stayed at the 0.###ms.  So i can see how if multiple streams are allowed it would cause the network to come to a halt.   My next step is to check the SAN hardware(cables, fiber cards, switch).   We did move from a slower sparc V440 master/media server to a much faster DELL720xd master/media.   The filesystems are also running a version of VxVM 4.1 that i need to upgrade.  However again all this was working prior to the migration.  

Thanks for the comments and suggestions.... I'll keep posting test results.

Nicolai's picture

You are looking at a problem outside Netbackup. 

Let the network admin check hw flow control is enabled on the switch and used by clients. Check if local firewall are used that block ICMP. ICMP is used as part of the TCP flow control.

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

devans3428's picture

ok.. ran appcritical and it does point to a network issue...  the one thing we did was when we moved the master/media server from solaris to redhat.   We moved to a different network subnet.  Will be checking with network team.

Nicolai's picture

Subnet move means traffic has to pass a router. Maybe the router isn't fit to the load. 

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

devans3428's picture

Thanks Nicolai for posting suggestions and thoughts..   Here is a recap and the latest.....

Version is master/media 7.1.0.3 as well as clients...no upgrades have been performed only migration

client_x(affected client) = solaris 9 sparc 6800

On or approx Jun 5th, we migrated our master/media server from solaris10(sparc V440) to RHEL 6(DELL r720xd).   Migration also involved moving to a new vlan for the master/media.  Upon a successful migration, i was able to backup and restore to various client servers running mixture of windows, solaris9 and 10, redhat.   Clients are running across two different vlans...some on the same vlan as the master/media others on a different vlan.  Again all servers are backing up and restoring successfully except this one server running on sparc 6800 with solaris 9....will call the client client_x. 

Before the migration client_x was backing up and restoring succesfully via both rman and filesystem level backups.  After the migration we started to experience either delays or complete network failure on client_x when the backups would kick off.   I could log onto client_x via the console and client appears to have no resource issues.  The only issue is that when I would try to ping default gateway from client_x, i would either drop packets or complete network failure.   The same would go for me pinging client_x from a different server...result would be loss packets or complete network failure.  

Now there was no network changes invovled with the client...I decided to look at the nic card even though there didnt show any errors or collisions and also check the cables even though things were working just fine prior to the migration of the master/media.   I physically swapped to a different nic and cable and even a different port on the switch.  I also moved to a different vlan client_x and all of this still resulted in either packet loss or complete network failure on client_x. 

I have patched the solaris9 to the latest recommended OS cluster.   client_x is running Symantec VxVM and i have upgraded to 5.1sp1 rp3, thus a recent version of Vxvm.

My network admin states he is not seeing any drops in packets but does see an elevated amount of traffic but not significant enough to cause any issues.  Also there are other servers being backed up on the same network with no issues.   I have also ran the backups during the most silent part of the day so the only backup running would be to client_x and i have the same results...packet loss or complete network failure.

I ran the AppCritical provided by the netbackup support and the results come back with network congestions which i already knew this so whats causing the congestions is the question. 

I can see a direct correlation between the time the backups start writing and reading to the time when my ping times increase to the point of packet loss or complete network failure.  I have bare metal restore checked in the policy and this has no issue.  Its only when the data starts to write from individual streams from client_x when i experience the issue.   As soon as i kill the backups the network returns to normal function.

I have uninstalled the client and reinstalled....

I have another call scheduled with symnatec backup support.    Its the case of having to prove something besides netbackup is causing this issue.  I have tried to slam the network with congestion to no avail...running 10 simultaneous 1GB file transfer to the client_x via scp/sftp ... the client_x didnt even blink.

devans3428's picture

Ok... i believe i have cracked this mystery... We are doing STU for backups master/media... the file SIZE_DATA_BUFFERS_DISK contained value 1048576...which is mentioned in this url

http://www.symantec.com/business/support/index?page=content&id=TECH35968

I backed that value down to minimum 262144 and no dropped packets no network congestion.  My rman's are running fine with 2 streams as well as my full filesystem level with 4 streams.  This will slow down my backups dramaticly i suppose.   I hate to use this low of a value for just one client because the value affects all clients.  I have not touched NUMBER_DATA_BUFFERS_DISK as it has a value of 16.   My take is this because the new master/media has capabilities of much larger faster transaction the client is not able to keep up.  

My ping times to the client increase slightly when initiating the backups but nothing like it did when the SIZE_DATA_BUFFERS_DISK file was 1048576...but again no packets lost.

Next step is to increase it to 524288 to see where the breaking threshold would be.

Will report back later.... 

Any other theories are welcome...

thanks

Nicolai's picture

Thanks for the update. From my experiance 262144 will do just fine, just increase NUMBER_DATA_BUFFERS to 128 or 256 and you will see a much better performance. 

From a test I did longe time ago:

Before tuning: 19MB/sec (default values)

Second tuning attempt: 49MB/sec (using 128K block size, 64 buffers)

Final result: 129MB/sec  (using 128K block size and 256 buffers)

It's the NUMBER_DATA_BUFFERS that does the magic :-)

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

devans3428's picture

Thanks Nicolai.... I have increased the NUMBER_DATA_BUFFERS_DISK and  NUMBER_DATA_BUFFERS TO 128 for starters.... I have moved the backups to a isolated nic from the public traffic on the client.   I still see minimal increase in ping times on the backup nic and also from time to time i may miss a packet.  However, the network continues to function from the public and backup nic.   So in conclusion, the fix was to lower the SIZE_DATA_BUFFERS_DISK to minimum 262144.  I will mark this one resolved after a week of doing backups to ensure nothing else creeps up.  As a side note i did increase tcp sliding window on the client solaris but i dont believe there was any affect on increasing the window.

 

Thanks

SOLUTION