Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Buffer / Memory Issues

Created: 07 Feb 2013 | 16 comments

Hello All,

Quiet few of my client backups have drastically slowed down with backup speed in the range of only 5000 - 8000 kbps.

Upon furhter checking performance of the server I found that the NETBACKUP SERVERS (both MASTER AND MEDIA) have almost 60% memory in use, 35-40% memory in standby and hence leaving literally no free memory.

Could this be the reason for degrading the backup to PureDisk units?

Thanks.

Regards,

Adnan

Comments 16 CommentsJump to latest comment

Marianne's picture

Have you verified 'deduplication server requirements' in NetBackup Deduplication Guide

Extremely important that sufficient CPU and memory resources are allocated to media servers.

Also see 'Media server deduplication sizing' in NBU Deduplication - Additional Info:  http://www.symantec.com/docs/TECH77575

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Adnan F's picture

Thank you Marianne.

My Netbackup Servers (both Master & Media) meet the recommended CPU & memory. I've about 20 TB data to backup each day. We do full backups daily. I've 20 GB RAM on each server.

Adnan F's picture

Some more info. is that "dbclient" appears to be waiting for quiet a while, below line from job activity:

Info bphdb(pid=25373) dbclient waited 506826 times for empty buffer, delayed 2060759 times  

Yasuhisa Ishikawa's picture

I've about 20 TB data to backup each day. We do full backups daily. I've 20 GB RAM on each server.

Maybe your server does not meet requirements. "20TB backing store" requires 20GB RAM at least - not for amount of daily backups.

XXTB storage for MSDP requires XXGB RAM dedicated for MSDP storage server. If some other NetBackup roles(like master server, opscenter) or some other application is running on the same host - both is not recommended, you need more RAM.

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

Marianne's picture

Please read through both docs that I have posted above.

There is more to dedupe backups: 

Extract from TECH77575:

The following are the factors that most affect deduplication performance on media servers:
■ The speed at which data streams from the clients.
See “Client data stream speed” on page 14.
■ Data ingest rate on the media server.
See “Data ingest rate on servers” on page 14.
■ CPU of the media servers.
See “Media server CPU and deduplication” on page 15.
■ Write speed to disk on the storage server.
See “Write speed to disk” on page 17.
 
A well as:
Network capacity between the servers
etc....

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Adnan F's picture

I've gone through the links provided, however our environment I believe meets the recommendations.

To rule out problems with PureDisk Unit, I even tried to take backup on a Basic Disk (same media server) but still get same speed.

Have contacted support who've asked for bptm, bpbrm (on media server) and dbclient (on client) logs.

Will update if we manage to find a solution. 

Adnan F's picture

Just going through the logs and I found something below which I'm not sure refers to the problem:

From bpbrm log:

23:11:01.963 [9944.10808] <8> file_to_cache_item: [vnet_addrinfo.c:6555] fopen() failed ERRNO=2 FILE=C:\Program Files\Veritas\NetBackup\var\host_cache\0c9\37d516c9+0,1,40a,2,1,0+"Client_Server_Name".txt

From bptm log, there seems to be a continuous pattern of below logs:

sts errno: 2060005
error message: sts_copy_extent returned with EBUSY or EAGAIN
returned value: 114032640
10:23:21.250 [8200.10656] <2> 97938:bptm:8200:"Media_Server": [DEBUG][proxy_copy_extent_v9]bytesCopied returned is 114032640
10:23:21.250 [8200.10656] <2> i_sts_copy_extent: offset from:732430336 to:732430336 length=51696369664 written=114032640 stserr=2060005 cpy_flag=2
10:23:21.250 [8200.10656] <2> i_sts_copy_extent: pxyh->b_imgoffset_dst=846462976 , pxyh->b_imgoffset_src=846462976, pxyh->b_written=846462976, pxyh->b_total=846462976
10:23:21.250 [8200.10656] <2> update_job_data: total kbytes = 826624 ts_delta_backup = 246918 rate calculation = 3347 (0 1 1)
10:23:21.250 [8200.10656] <2> update_job_data: delta kbytes = 183808 ts_delta = 62104 rate calculation = 2959
10:23:21.250 [8200.10656] <2> set_job_details: jobData (97938) 
10:23:21.250 [8200.10656] <2> send_structure_data: Index 36 Field m_nKbPerSec Value <3347>
10:23:21.250 [8200.10656] <2> ost_proxy_copy_whole_image: extent loop 8 r=2060005 (51696369664 114032640 846462976) 0 500
10:23:21.250 [8200.10656] <2> DUMPSTATE: image "Client_Name"_1360307923_C1_F1 (ev=0 ec=0) offset=846462976 LINE=2649
10:23:21.250 [8200.10656] <2> ost_proxy_copy_whole_image: image copy is not ready, retry attempt: 0 of 500 
10:23:21.250 [8200.10656] <2> ost_proxy_copy_whole_image: wrote short extent, length = 51696369664, bytesWritten = 114032640
10:23:22.264 [8200.10656] <2> i_sts_copy_extent: start to copy extent
10:23:22.264 [8200.10656] <2> 97938:bptm:8200:"Media_Server": [DEBUG][proxy_copy_extent_v9]prepare to call sts_copy_exent for image "Client_Server_Name"_1360307923
10:23:52.357 [8200.10656] <2> 97938:bptm:8200:"Media_Server": [DEBUG][proxy_copy_extent_v9]STSException with variable is caught in proxy_copy_extent_v9:
sts errno: 2060005
error message: sts_copy_extent returned with EBUSY or EAGAIN
returned value: 98041856
Mark_Solutions's picture

The image not ready means your system is very busy - it will re-try 500 times before failing or putting the disk down

Adding the:

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

with a value of 800 in it helps with timeouts and reduces the communications

I also agree with the other relating to the requirements - unless you really backup 20TB of data to 20TB of disk and do that every day?

If you backup 20TB of data to 60TB of disk then you should have 60GB of RAM

The NetBackup Appliances have 92GB RAM as standard these days even if they only have 36TB of disk (or is it 48?)

I assume that these are Windows Servers using MSDP?

If so have you done any memory tuning?

You should be using these values on the Media Servers:

HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\

 DWORD - PoolUsageMaximum  - Decimal value of 40

 DWORD - PagedPoolSize Hex value of FFFFFFFF (this is 8 x F)

Reboots are required after setting the above values

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Adnan F's picture

Thanks Mark.

I apologize I guess I've not been very clear about my environment.

Our backup data is approx. 20 TB. We do full backups for entire environment (Oracle, Exchange, VMware, etc).

Our backup acrchitecture consists of the following:

1. Master cum Media server (10 TB of PD with 20 GB RAM)

2. Media Server (18TB of PD with 18 GB RAM)

3. SAN Media Server (8TB of PD with 8 GB RAM)

The Master and second Media server are Win 2008 R2 Ent, while the SAN Media server is Win 2008 Std.

Approx. 8 TB (of file server data) goes on the SAN Media Server which is working perfectly fine. The rest 12 TB of data is divided between the other two backup servers.

(I've come across few backup professionals, who did state that taking full backups for entire env. may not be the best of ideas. This is an area where I'm currently looking into and would hopefully implement incremental backups for especially file servers.)

I'll try the various values, but have to be at my DC to make such changes. I'm working remotely right now.

Adnan F's picture

Just to give a brief idea, have attached a snapshot from resource monitor showing the utilization of memory.

Memory.png
Mark_Solutions's picture

I assume by PD your mean MSDP?

OK - I see spoold right up the top there so I am thinking that by doing the full backups all of the time your queue processing is working overtime.

It may be that you need to run it more frequently to keep things stream lined

It runs at midnight (never sems the right time to me as backups are usually running then!) and mid day

If you could have it run more often it would keep the queues a lot smaller and increase performance of the systems.  (maybe even use Windows scheduler to kick it off every 4 hours - it will just queue them up and run when the previous one stops [ crcontrol --processqueue])

take a look at you crcontrol --queueinfo and --processqueueinfo to see what figures you have (\program files\veritas\pdde\)

It may be an advantage to start using accelerator backups so that you get full backups every day with better throughput and less data transferred through the system.

If you do plan to do that then increase the worker thread from the default 64 to 128 (this may help you anyway in such a busy environement)

To do that edit the contentrouter.cgf file and look for the WorkerThreads setting (needs a NBU Service re-start)

DATA_BUFFERS can help too - have you tuned those? I tend to use 64 for the number (tape and disk), 262144 for tape size and 1048576 for the size of disk. Note that you need all files in place or disk will use the size set for tape.

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Adnan F's picture
Thanks again Mark I looked into the Buffer Info., will try it as soon as I'm back to work.
 
Did run crcontrol, gave the following:
 
C:\Program Files\Veritas\pdde>crcontrol --queueinfo
total queue size : 2200691394
creation date of oldest tlog : Fri Feb 08 12:23:34 2013
 
 
C:\Program Files\Veritas\pdde>crcontrol --processqueueinfo
Busy   : no
Pending: no
 
The queue size seems of a tremendous size.
 
Adnan F's picture

Another observation I had through Deduplication guide 7.5 (pg 160) is that apparently the sharedmemory settings (in agent.cfg) should be as follows:

Path: storage_path\etc\puredisk\agent.cfg

SharedMemoryEnabled=1
SharedMemoryBufferSize=262144
SharedMemoryTimeout=3600
 
However my settings are:
 
SharedMemoryEnabled=1
SharedMemoryBufferSize=262144
SharedMemoryTimeout=86400
 
Could the shared memory timeout value be the reason?
Mark_Solutions's picture

I wouldn't worry about the Shared Memory too much - you can change it if you wish ....

but the queue size is huge and iut obviously not clearing down very much with each run as it is not currently processing and has none queued

Kick it off twice now - the first will run, the second will queue ( --processqueue)

When both have finished check the size and do it another twoce, eventually it should come down in size after which you just need to keep on top of it.

I guess you back up so much every day it is struggling to keep up so will need manual intervention to keep it clear (or windows scheduler as i mentioned earlier - can't think where you change its standard schedule on Windows NBU)

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Adnan F's picture

Thanks Mark, I'm running the "crcontrol --processqueue" right now however the total queue size (through the cmd crcontrol -queuinfo) seems to be increasing instead of decreasing.

Mark_Solutions's picture

Been a bit tied up the last few days but wondering how things were going with this?

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.