Video Screencast Help

consistent Media Write error (84)

Created: 19 Feb 2007 • Updated: 22 May 2010 | 19 comments

Hi,

I'm working on tuning the throughput of my backups, and I run into a problem.

The environment I use is a StorageTek SL8500 with T10000 tapedrives. In some documentation for this specific drive I found on the web, I found the following parameters for this drive:

NUMBER_DATA_BUFFERS is set to 16
SIZE_DATA_BUFFERS is set to 2097152

NUMBER_DATA_BUFFERS_DISK is set to 16
SIZE_DATA_BUFFERS_DISK is set to 2097152

Windows 2003 server/SP1 with NetBackup 6 MP4 (SAN Media server called APP0109), no multiplexing and no checkmark for the fragment size.

The speed I get is acceptable (round 85.000 Kb/s) although double should be possible as well.

Unfortunately EVERY TIME the backup stops at the exact same point (After running for 16.5 minutes (Having done 77406208 KB and 125 files)):

2/19/2007 3:38:01 PM - started process bpbrm (1756)
2/19/2007 3:38:01 PM - connecting
2/19/2007 3:38:01 PM - connected; connect time: 00:00:00
2/19/2007 3:38:05 PM - mounting S00261
2/19/2007 3:38:44 PM - mounted; mount time: 00:00:39
2/19/2007 3:38:45 PM - positioning S00261 to file 1
2/19/2007 3:38:48 PM - positioned S00261; position time: 00:00:03
2/19/2007 3:38:48 PM - begin writing
2/19/2007 3:54:16 PM - Error bptm(pid=2772) FREEZING media id S00261, too many data blocks written, check tape/driver block size configuration
2/19/2007 3:54:17 PM - Error bpbrm(pid=1756) from client app0109: ERR - bpbkar exiting because backup is aborting
2/19/2007 3:54:20 PM - Error bpbrm(pid=1756) could not send server status message
2/19/2007 3:54:21 PM - end writing; write time: 00:15:33
media write error(84)


What I do not understand is if the blocksize is incorrect, why it runs for quite a bit?

Anybody know what is going on?

Fred

Comments 19 CommentsJump to latest comment

Stumpr2's picture

> What I do not understand is if the blocksize is incorrect, why it runs for quite a bit...

Thats a huge buffer you are trying to fill!

Create the legacy debug file /usr/openv/netbackup/logs/bptm and see whats going on there. Where did you get the dataa buffer size? what pub?

You gotta make sure the CPU server/memory can handle it.

VERITAS ain't it the truth?

Fred2010's picture

Hi Bob,

I know most drives only go up to 262144 (256 Kb) but this drive should be able to do 120 Mb/s uncompressed, hence the gigantic blocksize

I found the article discussing this drive here:

http://www.alliancestorage.com/pdf/whitepapers/Mix...

On page 39 you can see their suggestion for netbackup environments...

I'll try to get some extra logging done, good idea :)

Stumpr2's picture

Manfred,

Hey great link !!!!!

It looks like if you want to keep the 2MB buffer that you should probably cut down the number of buffers from 16 to 8 as described on the next page of the document you provided.

Also keep in mind the limitations of the media server as described in the Backup Planning and Performance Tuning Guide for UNIX, Windows, and Linux

IMPORTANT: Because the data buffer size equals the tape I/O size, the value specified in SIZE_DATA_BUFFERS must not exceed the maximum tape I/O size supported by the tape drive or operating system. This is usually 256 or 128 Kilobytes. Check your operating system and hardware documentation for the maximum values. Take into consideration the total system resources and the entire network. The Maximum Transmission Unit (MTU) for the LAN network may also have to be changed. NetBackup expects the value for NET_BUFFER_SZ and SIZE_DATA_BUFFERS to be in bytes, so in order to use 32k,
use 32768 (32 x 1024).Message was edited by:
Bob Stump

VERITAS ain't it the truth?

Fred2010's picture

Hi Bob,

Yeah, I was quite happy to find that document as wel :)

It has some great info on my drives (Also for your STKs if I recall correctly ;) )

The problem seems related to the checkpoint restart.

As soon as it comes to the 15 minute mark, it starts a new fragement and fails a positioning check for reasons unclear to me:


17:17:36.525 <4> write_backup: begin writing backup id app0109_1171900880, copy 1, fragment 2, to media id S00261 on drive index 2
17:17:36.525 <2> write_data: ndmp_dup_max_frag is set to 0
17:17:36.525 <2> write_data: twin_index: 0 active: 1 dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0 twin_is_disk 0 delay_brm: 0
17:17:36.525 <2> write_data: absolute block position prior to writing backup header(s) is 188984, copy 1
17:17:36.525 <2> write_data: block position check: actual 188984, expected 37800
17:17:36.525 <2> vnet_vnetd_service_socket: vnet_vnetd.c.2034: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
17:17:36.525 <2> vnet_vnetd_service_socket: vnet_vnetd.c.2048: service: bpjobd
17:17:36.666 <2> logconnections: BPJOBD CONNECT FROM 10.5.102.47.1766 TO 10.5.102.176.13724
17:17:36.666 <2> job_authenticate_connection: ignoring VxSS authentication check for now...
17:17:36.666 <2> job_connect: Connected to the host app0100 contype 10 jobid <29772> socket <1304>
17:17:36.666 <2> job_connect: Connected on port 1766
17:17:36.884 <2> job_monitoring_exex: ACK disconnect
17:17:36.884 <2> job_disconnect: Disconnected
17:17:36.900 <2> vnet_vnetd_service_socket: vnet_vnetd.c.2034: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
17:17:36.900 <2> vnet_vnetd_service_socket: vnet_vnetd.c.2048: service: bpdbm
17:17:37.103 <2> logconnections: BPDBM CONNECT FROM 10.5.102.47.1767 TO 10.5.102.176.13724
17:17:37.837 <16> write_data: FREEZING media id S00261, too many data blocks written, check tape/driver block size configuration
17:17:37.837 <2> send_MDS_msg: DEVICE_STATUS 1 1327 app0109 S00261 4000091 STK.T10000A.000.0.0.1.2 2000423 WRITE_ERROR 0 0
17:17:37.837 <2> JobInst::sendIrmMsg: returning
17:17:37.837 <2> log_media_error: successfully wrote to error file - 02/19/07 17:17:37 S00261 2 WRITE_ERROR STK.T10000A.000.0.0.1.2


Any further ideas? (Running you suggestion of 8 buffers in the background now, by the way)...

Fred

Fred2010's picture

Hmmm...

No go with less buffers: Stops at the exact same spot as before...

I am trying without CHECKPOINT RESTART now...

Also this seems interesting: Referring to a thread you posted some time ago:

http://forums.symantec.com/discussions/thread.jspa...

Fred

Stumpr2's picture

wow, that was old.

just making sure when I was checking the compatibility matrix there was a note 5 for the T1000 which reads..

5. new in ACSLS 7.1 with patch PUT0601 or higher equivalent

I'm not sure what the patch is forb but is it installed?

and in another area there is a reference to this technote:
http://seer.support.veritas.com/docs/271889.htm

With NetBackup 6.0 software, a pass-thru path is required for all tape drives; if a pass-thru path does not exist, the drive cannot be used by NetBackup software. The NetBackup Auto-Discovery functionality automatically creates pass-thru paths.
With NetBackup 6.0 software, WORM tape usage is automatically enabled. It can be turned off by touch file named DISABLE_WORM_TAPE in the usr/openv/netbackup/db/config/ directory. The existence of this file removes the requirement to have pass-thru paths to all tape drives. If the file exists, NetBackup will not check for WORM media and will write to WORM media with the standard tape format. This will cause append operations to WORM media to fail.Message was edited by:
Bob Stump

VERITAS ain't it the truth?

Fred2010's picture

Hi Bob,

I really have no idea if that patch is present on the ACSLS server to be honest.

Sun/Stk handle that machine to be honest (I only use the CMD_PROC)

Any way I can check if it is applied?

(Warning: Solaris newbie alert here! ;) )

By the way: Without a CHECKPOINT RESTART the job keeps on running... Is it a bug or a feature the position check?Message was edited by:
Manfred Engels

Stumpr2's picture

Dennis Strom is good with Solaris.

....Dennis??

VERITAS ain't it the truth?

Dennis Strom's picture

pkginfo -l for a full listing
pkginfo -p for a partial listing.

Fred2010's picture

Hi Dennis,

Thanks :)

loads of patches and packages are listed, but only when I lose the GREP (Sun obviously decided to name the patch different to the donloadable patch..)

I will ask the Sun people what patch is installed...

Stumpr2's picture

Fred,

Why don't you try knocking down the SIZE_DATA_BUFFERS to 262144 (256K) just as a test. That should at least prove the drives, the drivers, and such are good.Message was edited by:
Bob Stump

VERITAS ain't it the truth?

Fred2010's picture

Hi Bob,

Yup, I've tried that as well:

Both
SIZE_DATA_BUFFERS at 262144 (256K)

and

SIZE_DATA_BUFFERS at 2097152 (2048K)

Work, as long as I don't enable the CHECKPOINT RESTART @ 15 minutes... I figure it has to do with that POSITION CHECK failing, but have no idea what to do against it...

For now I'll disable the CHECKPOINT RESTART feature and let it write in a nice long stripe ;)

Thanks for your help guys!

Dennis Strom's picture

There are some bugs with 6 when using multistreaming and checkpoint restart.

AKopel's picture

Hey Dennis...
Do you know if these bugs still exist with MP4? I have been noticing some oddities with Checkpoints myself and was about to open a ticket. Do you have any specifics?

Thanks!
AK

Dennis Strom's picture

No specifics but I had a policy giving me fits and disable checkpoint restart and it went away. I think I remember someone else being on MP4 that experienced this.

AKopel's picture

Ok...
I haven't had any problems like that.. I am just having cases where Checkpoints do not seem to work at all (they "check" right back to the beginning)

Rakesh Khandelwal's picture

You can't check it from cmd_proc. But if you have access to command prompt/shell (you should be able to get to prompt after running logoff at cmd_proc) you can run

pkginfo -l STKacsls

This will give you o/p like below -

# pkginfo -l STKacsls
PKGINST: STKacsls
NAME: PUT0502S for ACSLS
CATEGORY: system
ARCH: sparc
VERSION: release 7.1.0
BASEDIR: /export/home
VENDOR: StorageTek
PSTAMP: cody20050728155800
INSTDATE: Mar 28 2005 10:50
STATUS: completely installed
FILES: 1583 installed pathnames
50 linked files
108 directories
410 executables
33 setuid/setgid executables
409472 blocks used (approx)
l

Fred2010's picture

The problem was related to the setting of the buffer size.

The following setting works (Even though the drive can do more, Windows can't apparently...):

NUMBER_DATA_BUFFERS is set to 16
SIZE_DATA_BUFFERS is set to 262144

Thanks for all help!