Too Many Jobs Active = Status 84's
Has anyone had any experience with this. It just started happening the past couple of days and tonight's Fulls should be an adventure. I have about 600 jobs that begin at 6:00 PM. I have a 20 drive library and multiplexing set to 3 per drive. Even when everything is running great, I dont' think I've ever actually seen 60 jobs running at once. For the past couple of nights, there are over 90 Active jobs trying to write to tapes and I get Status 84's out the wazoo. That goes on for maybe an hour and then it all seems to settle down and start running smoothly. Most of the jobs only fail the one time, requeue, and finish fine. Some jobs fail completely and when I restart them, they finish fine. Everything runs great until the 10 o'clock wave of jobs begin and it's Status 84 time again for another 30 or 45 minutes. I have 7 media servers sharing the library and it's as if they have forgotten that the library is being shared and everyone is trying to write at the same time. Each storage unit is set to use only 4 drives max. This is a Windows environment and i'm on 5.1 MP6 and Windows server 2003. Any ideas?
Thanks,
Randy
Comments
Something to start with .......
This issue has been seen on Windows 2003 Media servers that are using a value larger than 64k (56636 bytes) in the SIZE_DATA_BUFFERS after upgrading to Windows 2003 SP1. The issue is a change to tape.sys in the SP1 patch that limits the block size to <64k for Tape transfers. Microsoft is aware of this issue and has published a knowledge base article and hotfix to correct this (see below)
If the patch cannot be applied for any reason as a work around the value in SIZE_DATA_BUFFER can be set to 64k or below but this may effect backup performance. The SIZE_DATA_BUFFERS touchfile can be found in the netbackup\db\config directory and should contain a byte value that is a multiple of 1024, below are a few examples.
64k = 56636
128k = 131072
256k = 262144
Here is a link to the Microsoft Knowledge base article
http://support.microsoft.com/?kbid=907418Message was edited by:
RK
STATUS CODE: 84 "Media Write Failed" error occurs consistently on certain media that are not known to be defective.
http://support.veritas.com/docs/277081
Exact Error Message
Media Write Failed (<84>)
<16> io_write_block: write error on media id N00041, drive index 2, writing header block, 19
Details:
Overview:
Status Code 84 "Media Write Failed" error occurs consistently on certain media that are not known to be defective.
Troubleshooting:
Please look for messages similar to the following. Observe the "19" at the end of the line, following the write error on the media header. This is a message number reported by the OS, that can be translated with the net command. Typing net helpmsg 19 from a command prompt reports "The media is write protected." When this message is seen, write protection is the cause.
Master Log Files: N/A
Media Server Log Files:
BPTM:
<16> io_write_block: write error on media id N00041, drive index 2, writing header block, 19
Client Log Files: N/A
Resolution:
Remove the write protection from the tape by adjusting the write protect notch, or by following the recommended procedure from the media manufacturer on how to disable write protection. If media is intentionally meant to be write protected, either remove the media from the robot or freeze the media in NetBackup (tm) so that it is not available for backup attempts. Media can be frozen and unfrozen using the bpmedia command. For more information on this procedure, please see the NetBackup Commands for Windows Guide.
http://support.veritas.com/docs/275076
http://support.veritas.com/docs/275066
In-depth Troubleshooting Guide for Exit Status Code 84 in NetBackup (tm) Server / Enterprise Server 5.0 / 5.1
http://support.veritas.com/docs/273908Message was edited by:
RK
7 media servers X 4 drives = total of 28 tape drives
simply doesn't add up. Your short 8 tape drives.oh...you already knew that !
I have about 600 jobs that begin at 6:00 PM
.....That goes on for maybe an hour
how about starting some of them at 7PM?
Everything runs great until the 10 o'clock wave of jobs
....again for another 30 or 45 minutes
How about starting some of these at 11 o'clock
too simple? j/k
seriously, try adjusting the backup windows so that you don't get slammed all at once. I think there are also some changes that you should probably make for tuning if you do subscribe to the big bang theory. I don't remember it off hand but there is a technote out on it.
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
I used to have over 1200 that started at 6:00. Sad story but we finally hired someone to start helping out so I could find time to fine tune our installation and then last week one of my guys was killed in a motorcycle wreck. Now we're back to having just enough time to put the fire out and move on to the next task.
I am in the process of prioritizing Production Critical boxes and i'm going to kick those off first and then move down the food chain of clients. I'm looking into Rakesh's suggestion because I also saw an error about block size and the tape not accepting a certain size but I don't recall the exact error. I thought it was strange because i hadn't changed anything regarding block size but we did just roll out the latest MS patches.
> one of my guys was killed in a motorcycle wreck
Randy, I'm so sorry for you. It is hard losing someone like that.
When I was in the Navy during an overhaul period in Bremerton Washington shipyards, I had 2 room mates die from motorcycle accidents. One hit a tree and the other one hit a mailbox. They were both young. mid/early twenties. I haven't been on a bike since then.
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
Last year I FINALLY made up my mind that I was going to buy a motorcylce with my tax refund this year. I now finally made up my mind that i never will.
Here's the error I'm seeing on one of the media servers.
rror bptm(pid=7912) The tape device at index -1 has a maximum block size of 32768 bytes, a buffer size of 65536 cannot be used
STATUS CODE 84: After applying Service Pack 1 in Windows 2003, Status 84 errors occur during tape backup. An additional error appears in the Activity Monitor, noting that the buffer size cannot be used.
http://support.veritas.com/docs/278837
Backup fails with a Status Code 84.
http://support.veritas.com/docs/246554Message was edited by:
Bob Stump
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
Oh great, I just switched to the latest HP driver because it was newer. But it was working fine before the patch. Maybe I need to run the driver install again to make sure?
From looking through my bptm logs, it looks like all of my drives are set to a 32k limit. How did that happen or where does that get configured? Would that be the driver? I'm using the latest HP driver now; should I contact HP? Or just reinstall the Veritas driver i was using before?
I would reinstall the Veritas driver
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
I have a SAN Media server that fails immediately with a Status 84 and gives me the 32k vs. 64k buffer error. If I point it at one of the media servers, the backup runs fine. I have verified that they are all using the same HP driver. Some work, some don't. At this point I'm getting so close to tonight's backups that I'm afraid to change anything. I'd rather just restart the failed jobs and contact HP or Symantec on Monday unless someone has a quick solution. We're patching a remote site tonight and I'm wondering if I'm going to do the same to the backups there.
Since applying the MS patches last Friday, I've had 586 Status 8x regarding media. The 3 weeks prior to the patch, I had 137. Seems like I found the culprit but I can't do anything about it for now. No reboots allowed until after the weekend. This should be a fun weekend.
any changes in throughput? perhaps active jobs are running slower thus longer?
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
The SAN media server that was failing immediately with Status 84's had a NUMBER_DATA_BUFFERS file set at 128 but didn't have a SIZE_DATA_BUFFERS file. I created the SIZE_DATA_BUFFERS file and set it at 32668 and the backup is running fine. It's running slower than normal, A LOT slower, but that's to be expected I would guess. My enterprise is too big to try to fix this now. I'm going to have to wade through the errors over the weekend and fix this on Monday.
Do I ask HP for a new driver; do I lay this on them and make them fix it? Or will installing the old Veritas driver fix it? Why wouldn't the MS patch affect the Veritas driver? Or would it have affected the Veritas driver if that's what I had loaded at the time?
Sorry about all of the questions but I've had my head stuck so far up NetBackup's tail end that I've forgotten all of my microsoft basics.
Any reason you are using buffer size only 32K instead of 64K?
I would suggest install veritas driver unless NBU release notes says anything specific about the drive/library type.
Apparently the HP driver is restricting me to 32k. The normal backups were failing with Status 84 with a message about the drive only being able to handle 32k. I added the SIZE_DATA_BUFFER file, set it to 32k, and the backups ran fine. I don't want to use 32k but something is restricting me to that number. I'm sure rolling back the driver to the previous Veritas release will fix it but that will require a reboot and i'm out of my reboot window. I'll have to hope for the best this weekend and fix it on Monday.
I would assume applying the first patch you recommended would also fix the problem?
Randy,
My current contract expires in 1 month. I am talking with a company called Rackspace about an opportunity in San Antonio. Do you know anything about this company? I thought since you lived only a couple hundred miles that you may have heard of them.
http://www.rackspace.com/index.php
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
Bob,
Good luck on your future job!
Do me a favor though: Tell 'em to get rid of that 'automatic-chat-to-a-live-person thingy' on their homepage ;)Message was edited by:
Manfred Engels
I have to give it to Rakesh on this one. The "19" at the end led me to the culprit. When I first read the response I dismissed it because certainly no one would load so many write protected tapes to cause hundreds of Status 84's. If I could list names I would but I doubt they want that kind of publicity. I pulled 30 tapes just from last night's errors and I'm going back through the logs now to see how many more there are. I have jobs running constantly so it's difficult for me to open the door and just look.
Other than the failure message, will NetBackup tell me if a tape is write protected BEFORE it tries to use it?
There is no way for NetBackup to find out if tape is write protected until it tries to write on the tape.
Other than error in bptm logs I think you may see mdia is write protected and NetBackup freezing it in your bperror -media logs
I can live with that now that I know what the problem is. Thanks again, Rakesh.
Glad I was able to help.
Thanks for the points :)
Believe it or not, I had an operator place the tapes upside down in the library. The inventory went OK. I'm still not sure how the barcode reader could read a tape label that was upside dowm. But it did! Then when Netbackup tried to use the tape it couldn't load the tape upside down and so it would freeze it.
My problem at the time was that I could not afford to have so many tapes unavailable. I needed them for backups.
Bob Stump VERITAS - "Ain't it the truth?" Incorrigible punster -- Do not incorrige
I used to leave tapes on one guy's desk with a sticky note letting him know the tapes were ready to be put back in the library. And that's exactly what he did; sticky note and all. I'm telling you, there's a cartoon strip in the making here. NBU Peanuts.
Would you like to reply?
Login or Register to post your comment.