Video Screencast Help

Restore job hangs and does not fail...

Created: 16 Jun 2010 • Updated: 26 Jan 2013 | 7 comments
Android's picture
This issue has been solved. See solution.

At least not in the activity monitor.  You can see a status 85 in the bptm log from the media server. This is occurring for more than one tape as you can see below so I'm guessing this is not bad media.  I thought it most likely a SIZE_DATA_BUFFER issue.  I'm still digging but this is what I have so far.  Any help is appreciated.

Master and all media servers are AIX.  This has been attempted on both the media server that ran the backup job and alternate media servers.

The following are from the bptm log on the media server:

20:34:22.805 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2060: service: bpdbm
20:34:22.930 [639034] <2> logconnections: BPDBM CONNECT FROM 10.93.28.127.37774 TO 10.93.39.191.13724
20:34:23.355 [639034] <8> io_read_block: read error on media id B00677, drive index 5 reading header block, A remote host did not respond within the timeout period.
20:34:23.355 [639034] <2> io_ioctl: command (12)MTBSF 4 from (bptm.c.6557) on drive index 5
20:35:24.823 [639034] <2> io_ioctl: command (11)MTFSF 1 from (bptm.c.6559) on drive index 5
20:35:24.834 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:35:24.878 [639034] <2> io_position_for_read: at file number 358, skipping forward
20:35:24.878 [639034] <2> io_ioctl: command (11)MTFSF 3 from (bptm.c.6833) on drive index 5
20:35:45.330 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:35:45.385 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:35:47.803 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2046: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
20:35:47.803 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2060: service: bpdbm
20:35:47.967 [639034] <2> logconnections: BPDBM CONNECT FROM 10.93.28.127.37777 TO 10.93.39.191.13724
20:35:48.398 [639034] <8> io_read_block: read error on media id B00677, drive index 5 reading header block, A remote host did not respond within the timeout period.
20:35:48.398 [639034] <2> io_ioctl: command (6)MTREW 1 from (bptm.c.6570) on drive index 5

20:36:16.971 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:36:17.030 [639034] <2> io_position_for_read: at file number 1, skipping forward
20:36:17.030 [639034] <2> io_ioctl: command (11)MTFSF 360 from (bptm.c.6833) on drive index 5
20:36:58.143 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:36:58.198 [639034] <2> io_read_back_header: drive index 5, reading backup header
20:37:00.802 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2046: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
20:37:00.802 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2060: service: bpdbm
20:37:00.895 [639034] <2> logconnections: BPDBM CONNECT FROM 10.93.28.127.37799 TO 10.93.39.191.13724
20:37:01.336 [639034] <8> io_read_block: read error on media id B00677, drive index 5 reading header block, A remote host did not respond within the timeout period.
20:37:01.379 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2046: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
20:37:01.379 [639034] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2060: service: bpdbm
20:37:01.524 [639034] <2> logconnections: BPDBM CONNECT FROM 10.93.28.127.37800 TO 10.93.39.191.13724
20:37:01.974 [639034] <16> io_position_for_read: cannot read backup header on B00677, tape may be bad
20:37:01.974 [639034] <2> send_MDS_msg: DEVICE_STATUS 1 8022262 nmbpmed21 B00677 4000798 LTO09 2000058 READ_ERROR 0 0
20:37:02.055 [639034] <2> log_media_error: successfully wrote to error file - 06/15/10 20:37:02 B00677 5 READ_ERROR LTO09
20:37:02.055 [639034] <2> check_error_history: just tpunmount: called from bptm line 892, EXIT_Status = 85
20:37:02.057 [639034] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tpunmount)
20:37:02.057 [639034] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_LTO09, from bptm.c.18578
20:37:08.811 [639034] <2> drivename_write: Called with mode 1
20:37:08.811 [639034] <2> drivename_unlock: unlocked
20:37:08.811 [639034] <2> drivename_close: Called for file LTO09
20:37:08.811 [639034] <2> tpunmount: NOP: MEDIA_DONE 0 6996153 0 B00677 4000798 180 {1B216E96-1DD2-11B2-9B67-46D74ACF0000}
20:37:08.811 [639034] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 1073344 (tmmpx.c:1761)

20:37:12.604 [639034] <2> send_brm_msg: ERROR 85
20:37:12.604 [639034] <2> mpx_terminate_exit: Calling tpunmount for media B00677
20:37:12.604 [639034] <2> send_MDS_msg: MEDIA_DONE 0 6996153 0 B00677 4000798 180 {1B216E96-1DD2-11B2-9B67-46D74ACF0000}
20:37:12.605 [639034] <2> packageBptmResourceDoneMsg: msg (MEDIA_DONE 0 6996153 0 B00677 4000798 180 {1B216E96-1DD2-11B2-9B67-46D74ACF0000})
20:37:12.605 [639034] <2> packageBptmResourceDoneMsg: keyword MEDIA_DONE version 0 jobid 6996153 copyNum 0 mediaId B00677 mediaKey 4000798 unloadDelay 180 allocId {1B216E96-1DD2-11B2-9B67-46D74ACF0000}
20:37:12.605 [639034] <2> packageBptmResourceDoneMsg: returns 0
20:37:12.613 [639034] <2> JobInst::sendIrmMsg: returning
20:37:12.613 [639034] <2> mpx_terminate_exit: EXITING with status 85
20:37:12.613 [639034] <2> cleanup: removing shared memory for mpx, shmid = 166723613

21:00:44.035 [704564] <16> io_position_for_read: cannot read backup header on B00287, tape may be bad
21:00:44.035 [704564] <2> send_MDS_msg: DEVICE_STATUS 1 8022265 nmbpmed21 B00287 4000305 LTO09 2000058 READ_ERROR 0 0
21:00:44.113 [704564] <2> log_media_error: successfully wrote to error file - 06/15/10 21:00:44 B00287 5 READ_ERROR LTO09

21:40:07.869 [745480] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2060: service: bpdbm
21:40:07.998 [745480] <2> logconnections: BPDBM CONNECT FROM 10.93.28.127.39427 TO 10.93.39.191.13724
21:40:08.439 [745480] <16> send_brm_msg: unable to write bpbrm message, connection dropped or not connected (errno = There is no process to read data written to a pipe.)
21:40:08.439 [745480] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_LTO04, from mpxrestore.c.2734
21:40:08.439 [745480] <2> mpx_waiting_term: waiting for TERMINATE or another START RESTORE
21:40:08.439 [745480] <2> read_brm_msg: TERMINATE

#bpmedialist -mcontents -m B00606 -L
media_id = B00606, partner_id = *NULL*, version = 1
density = hcart2 (14)
allocated = 06/13/2010 05:30:07 (1276425007)
last_written = N.A (0)
expiration = N.A (0)
last_read = N.A (0)
kbytes = 0, nimages = 0, vimages = 0
retention_level = 4, num_restores = 0
status = 0x0, l_offset = 0, psize = 0, hsize = 1024, ssize = 0
vmpool = 6
res1 = 0, res2 = 0, res3 = 0, res4 = 0

backup_id = xxxxxxxxx_1276425007, version = 2
creation = 06/13/2010 05:30:07 (1276425007)
expiration = 08/14/2010 05:30:07 (1281781807)
retention_level = 4, fragment = 88, file_num = 1
block_size = 262144, status = 0x0
media_id = B00606, size = 0, data_start = 2
client_type = 0, copy_num = 1
sched_type = 0, flags = 0x0
policy = 2_Orafin_unix
sched_label = Full
opt_extra = 0, mpx_headers = 6, res1 = 0

# bpmedialist -mcontents -m B00287 -L
media_id = B00287, partner_id = *NULL*, version = 1
density = hcart2 (14)
allocated = 06/14/2010 20:23:29 (1276565009)
last_written = N.A (0)
expiration = N.A (0)
last_read = N.A (0)
kbytes = 0, nimages = 0, vimages = 0
retention_level = 3, num_restores = 0
status = 0x0, l_offset = 0, psize = 0, hsize = 1024, ssize = 0
vmpool = 6
res1 = 0, res2 = 0, res3 = 0, res4 = 0

backup_id = xxxxxxxxxx1_1276565009, version = 2
creation = 06/14/2010 20:23:29 (1276565009)
expiration = 07/15/2010 20:23:29 (1279243409)
retention_level = 3, fragment = 118, file_num = 1
block_size = 262144, status = 0x0
media_id = B00287, size = 0, data_start = 2
client_type = 13, copy_num = 1
sched_type = 0, flags = 0x4
policy = 2_General_win_1900
sched_label = Full_1_mo
opt_extra = 0, mpx_headers = 6, res1 = 0

Comments 7 CommentsJump to latest comment

Nicolai's picture

I don't think it's a SIZE_DATA_BUFFERS problem (nor shared memory allocation issue) . 262144 is 256K and a good value to use.

Do you have direct tape connect NDMP filers around ?.

From time to time I have seen Netbackup choosing a NDMP device as read host, and do NDMP level 3 remote read. If this NDMP device didn't respond this may answer the "remote host did not ...."

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

Android's picture

There re no NDMP hosts.  It appears that the job is now failing with a status of 5.

06/17/2010 13:31:39 - begin Restore
06/17/2010 13:31:43 - number of images required: 1
06/17/2010 13:31:43 - media needed: B00170
06/17/2010 13:33:48 - restoring from image xxxxxxxxxxxx_1274759754
06/17/2010 13:33:51 - connecting
06/17/2010 13:33:54 - Warning bpbrm (pid=209144) expected start message from nmcabsdev1; read: Unrecognized -J string spsrestoreoptions=0
06/17/2010 13:33:54 - connected; connect time: 0:00:00
06/17/2010 13:33:52 - requesting resource B00170
06/17/2010 13:33:53 - granted resource  B00170
06/17/2010 13:33:53 - granted resource  LTO02
06/17/2010 13:33:55 - started process bptm (pid=1720442)
06/17/2010 13:33:55 - mounting B00170
06/17/2010 13:34:26 - mounted B00170; mount time: 0:00:31
06/17/2010 13:34:28 - positioning B00170 to file 355
06/17/2010 13:35:14 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:36:56 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:38:14 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:38:14 - Error bptm (pid=1720442) cannot read backup header on B00170, tape may be bad
06/17/2010 13:38:29 - Error bptm (pid=2293776) The following files/folders were not restored:
06/17/2010 13:38:31 - Error bptm (pid=2293776) UTF - /core
06/17/2010 13:38:31 - restored from image nmcabsdev1_1274759754; restore time: 0:04:43
06/17/2010 13:38:34 - Warning bprd (pid=688352) Restore must be resumed prior to first image expiration on Thu Jun 24 22:55:54 CDT 2010
06/17/2010 13:38:36 - end Restore; elapsed time 0:06:57
the restore failed to recover the requested files (5)

wr's picture

UTF - /core

^ wouldn't think that would be any great loss blush

256k is pretty large block for tape (default 64k). 
have you ever been able to do a restore with this setting?

Will Restore -- where there is a Will there is a way

Arasan's picture

Hi,

I too experience same media read error issue.   Rebooted netbackup setup including tape library.  There is no problem with the backups.  I have tried for multiple clients including client recently backed up.  All has same error as below.

Please let us know if your problem was resolved.

06/17/2010 13:31:39 - begin Restore
06/17/2010 13:31:43 - number of images required: 1
06/17/2010 13:31:43 - media needed: B00170
06/17/2010 13:33:48 - restoring from image test1_1274759754
06/17/2010 13:33:51 - connecting
06/17/2010 13:33:54 - Warning bpbrm (pid=209144) expected start message from test1; read: Unrecognized -J string spsrestoreoptions=0
06/17/2010 13:33:54 - connected; connect time: 0:00:00
06/17/2010 13:33:52 - requesting resource B00170
06/17/2010 13:33:53 - granted resource  B00170
06/17/2010 13:33:53 - granted resource  LTO02
06/17/2010 13:33:55 - started process bptm (pid=1720442)
06/17/2010 13:33:55 - mounting B00170
06/17/2010 13:34:26 - mounted B00170; mount time: 0:00:31
06/17/2010 13:34:28 - positioning B00170 to file 355
06/17/2010 13:35:14 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:36:56 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:38:14 - Warning bptm (pid=1720442) read error on media id B00170, drive index 1 reading header block, A remote host did not respond within the timeout period.
06/17/2010 13:38:14 - Error bptm (pid=1720442) cannot read backup header on B00170, tape may be bad
06/17/2010 13:38:29 - Error bptm (pid=2293776) The following files/folders were not restored:
06/17/2010 13:38:31 - Error bptm (pid=2293776) UTF - /core
06/17/2010 13:38:31 - restored from image test1_1274759754; restore time: 0:04:43
06/17/2010 13:38:34 - Warning bprd (pid=688352) Restore must be resumed prior to first image expiration on Thu Jun 24 22:55:54 CDT 2010
06/17/2010 13:38:36 - end Restore; elapsed time 0:06:57
the restore failed to recover the requested files (5)

-Regards,

Arasan

wr's picture

UTF - /core

I would exclude /core from backup 

Is there some compelling reason you need to restore that ?

Will Restore -- where there is a Will there is a way

Nicolai's picture

UNIX systems never had any issues with large SCSI block sizes. You only have to be careful when running Windows.

My primary concern is this message:  A remote host did not respond within the timeout period

Do you have multiple network card in the Netbackup servers ?. Have media servers been added lately ?

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

Android's picture

After looking through extensive NBU logs I was finally able to get the system logs from the end user. It was one of the first things I asked for.  Both media servers are showing hardware issues for multiple tape drives and also the HBA's.  I advised them to explore the issue from the hardware side, looking first at the SAN environment since problems were reported on both media servers for multiple drives and they're using SSO.  

Once I have an update I will let you guys know.

SOLUTION