Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Backup job is failing with error 24

Created: 26 Jan 2012 | 16 comments

Hi,

Backup job is failing since long time. Not able to find the root cause of the issue. Need your help:

Master server: NBU ver 7.0.1/ OS Windows 2008 R2

Media server: NBU ver 7.0.1/ OS Windows 2003

Client: NBU ver 7.0.1/ OS Windows 2003.

Policy details have been attached.

Failed job detaisl have been attached.

NIC details: Media server: 100MBs Full / Client: 1GBs Full.

Attached bpbkar log from client and bptm log from media server. Attached logs for two days just in case.

Some more details: Cleint has C:\, D:\, E:\, F:\, G:\ and I:\ drives. We have created (actually been created, not by me) two policies. One is for all drives except E:\ drive, and another one is for E:\ drive and we split the folders in the backup selection. When policy runs it starts 11 streams (child jobs) and 4 of them are failing. Almost allways same folders.

Comments 16 CommentsJump to latest comment

MKT's picture

Looks like a pretty standard disconnect according to the bpbkar logs:

I'd start with disabling TCP chimney/offload as a start
http://www.symantec.com/docs/TECH60844

 

Amaan's picture

Thanks, will go through the TN. will let you know the results.

nesel's picture

Hi Amaan,

> Does IP address resolve from hostname in all parties?  

> Does NBU master/media server able to connect on the client?  (Host Properties > Clients > (find host)

 

Thanks

mph999's picture

 

1.  How many clients have this error 
2.  Did this client previously work
3.  What was changed 
4.  Does it write some data then fail
5.  Does it fail at the very beginning of the job
6.  Does it always fail at the same point
7.  Operating system of client
8.  Operating system of media server
9.  NetBackup version
10. Logs from media server - bptm and bpbrm, from client bpbkar, bpcd
11  From the media server - output of bptestbpcd -host <client name> -verbose
 
In my experience, Status 24 is hardly ever NBU (in fact, I don't think I have ever seen a status 24 failure caused by NetBackup myself)
 
Something below normally fixes it ...  Yes, it is a lot to read, and will probably tyake a number of hours to go through.
 
If this is a Windows client, a very common cause is the TCP Chimmey settings  - http://www.symantec.com/docs/TECH55653
 
I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.
 
 
http://www.symantec.com/docs/TECH76201
 
Possible solution to Status 24 by increasing TCP receive buffer space 
 
http://www.symantec.com/docs/TECH34183 
this Technote, although written for Solaris, shows how TCP tunings can 
cause status 24s. I am sure your system admins will be aware of the 
corresponding setting for the windows operating system. 
 
http://www.symantec.com/docs/TECH55653 
This technote is very important. It covers many many issues that can 
occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
Segmentation Offload (TSO) are enabled. It is recommend to disable 
these, as per the technote. 
 
I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
 
 
 
http://www.symantec.com/docs/TECH150369
A write operation to a socket failed, these are possible cause for this issue:
 
A high network load.
Intermittent connectivity.
Packet reordering.
Duplex Mismatch between client and master server NICs.
Small network buffer size
 
 
http://support.microsoft.com/kb/942861 
SOLUTION/WORKAROUND:
Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.
 
This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client.  Disable this feature to workaround this problem.
 
To do this, at a command prompt, enter the following:
Netsh int ip set chimney DISABLED
 
 
 
http://www.symantec.com/docs/TECH127930
The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115. 
 
(TECH72115 is not relevant to you, this was an issue with a SAN client, fixed in 6.5.4)
 
But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.
 
 
http://www.symantec.com/docs/TECH145223
The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.
Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.
Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.
 
 
Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.
 
 
http://www.symantec.com/docs/S:TECH130343  (Internal technote)
 
The issue was found to be due to NIC card Network congestion (that is, network overloaded)
 
 
 
http://www.symantec.com/docs/TECH135924  (I think this one I sent previously, shows the MS fix for the issue)
 
In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.
 
If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), executing these two commands, followed by a reboot will resolve the problem:
 
netsh int ip reset resetlog.txt   Microsoft Reference:  http://support.microsoft.com/kb/299357 
netsh winsock reset catalog    Microsoft Reference:  http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx 
 
NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values.  This means that if the host is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot.  The default TCP setting is to use DHCP and the host will be using DHCP upon booting up.
 
 
http://www.symantec.com/docs/TECH76201
Possible solution to Status 24 by increasing TCP receive buffer space 
 
 
http://www.symantec.com/docs/TECH34183 
this Technote, although written for Solaris, shows how TCP tunings can 
cause status 24s. I am sure your system admins will be aware of the 
corresponding setting for the windows operating system. 
 
http://www.symantec.com/docs/TECH55653 
This technote is very important. It covers many many issues that can 
occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
Segmentation Offload (TSO) are enabled. It is recommend to disable 
these, as per the technote. 
 
I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
 
 
There are  2 possible issues that could be NBU related that could cause this :
 
1.  Client NBU version is higher than the media serevr
2.  Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570
)
 
 
What to do next:
 
 
 
http://www.symantec.com/docs/TECH135924  (mentioned before, MS suggested fix)
http://www.symantec.com/docs/TECH60570  (communications buffer, mentioned above)
http://www.symantec.com/docs/TECH60844
 
 
If these do not resolve the situation, I would recommend you talk with the Operating system vendor.  In summary, apart from the Client version of software and the communication buffer size (set in host properties) I can find no other cause that could be NBU.  However, from the very detailed research I have done, I can find many many causes that are the network or operating system.
 
Regards,
 
Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Amaan's picture

Thanks Martin,

I will go through each and every one of these TNs. Will try and share the results.

Some of your questions has been answered in the main thread. Will answer others later.

Amaan's picture

I went throug all above mentioned TNs. none of them helped. opened case with Symantec. working on that. will keep you updated.

Marianne's picture

I just had another look at your opening post:

NIC details: Media server: 100MBs Full / Client: 1GBs Full.

You need a 1GB NIC in the media server as a minimum!

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

mph999's picture

Aman,

Hmm, one of the above normally fixes it ... Marianne has a good point, 100M NIC on media server is madness.

The case should have be opened with the operating system vendor - not Symantec - Status 24 are network / operating system related.  I am unsure why Symantec are expected to solve non NetBackup issues ...

When troubleshooting a problem, it makes sense to get those on board who have the best information / experience with the problem, in the case of network issues, and in particular Status 24, as mentioned already, this is very unlikely to be NetBackup.  In fact, I don't think I have every seen a status 24 caused by NetBackup personally ...   Therefore, the most likely cause of the issue will be the Network or Operating system, hence the suggestion to log the call with those who are 'responsible' for the 'part' most likely causing the problem.

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Amaan's picture

I agree with you Marianne, when i saw this issue first time i thought the same as you, but after posting this i didnt get anything like that from our experts, so decided that i may be wrong smiley

in this case i will try to change client to Full 100MBS and will see if the backup ends successfully.

thanks!

Marianne's picture

I see a major issue with media server with 100BaseT NIC. How old is this media server if it has a NIC that old? My last 3 laptops (replaced every 3 years) all had 1 GB NIC's. Which poses the question - how old is the firmware on the NIC? How old is the driver?

Why not motivate for a 1 GB NIC? Or even better - if Data Protection is important to Business - a new media server.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Amaan's picture

I found that compression is enabled on the same drive (E:\). is this could cause the issue. if the compression is enabled how netbackup gets the files. does it get them compressed or it will uncompress them first and then sent it to storage?

Yasuhisa Ishikawa's picture

When compress in policy attribute checked(say Client Compress feature), compression take part in client host. Client process read data, compress it, and sent it to media server. Media Server just recieve it, and store it into storage. No additional operation take place in Media server.
Uncompression will take place in client host when you restore data.

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

Amaan's picture

Actually compression was enabled on drive itself. In drive properties, not in netbackup.

Yasuhisa Ishikawa's picture

I've misunderstood - Yes, you mean Windows' compress feature.

This may help your understanding of Windows' compression feature.
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/ff_file_compress_overview.mspx

 

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

Marianne's picture

You do not have 'file read errors'. As can be seen in bpbkar log, you have 'socket write error':

Exception of type [SocketWriteException]
 > tar_tfi::processException:
An Exception of type [SocketWriteException] has occured at:
  Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.51 $ , Function: TransporterRemote::write[2](), Line: 307
  Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.85 $ , Function: Packer::getBuffer(), Line: 659
  Module: tar_tfi::getBuffer, Function: H:\701\src\cl\clientpc\util\tar_tfi.cpp, Line: 296
  Local Address: [0.0.0.0]:0
  Remote Address: [0.0.0.0]:0
  OS Error: 10054 (An existing connection was forcibly closed by the remote host.)

 

You will need bpbrm log on media server as well.

I have seen similar errors resolved when NIC in media server was replaced.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Amarnath Sathishkumar's picture

http://www.symantec.com/business/support/index?page=content&id=TECH43249

Amarnath Sathishkumar

If this comment is helpfull, Don't forget to give a "Thumbs Up" or mark as "Solution"