Best Practices: NetBackup PureDisk 6.6.1.x/663a or 50x0 220.127.116.11 appliance Content Router Rerouting Checklist
|Article:TECH168356|||||Created: 2011-08-29|||||Updated: 2012-04-26|||||Article URL http://www.symantec.com/docs/TECH168356|
Rerouting distributes information to the content routers in a storage pool. After you add a new content router to a storage pool, you activate the new content router and reroute the data on your content routers. The rerouting process redistributes the data evenly across all activated content routers in a storage pool. During the rerouting process, the content routers still send and receive data. It is normal for the rerouting process to take days, even a week or more to complete.
• Once rerouting is running it is not allowed to add or remove another CR until rerouting is finished successfully.
• Backups will be slower during rerouting. This slowdown will manifest itself the most at the beginning of the rerouting process. Backup speed will improve again as rerouting progresses. Additionally, active backups/duplications can slow down the rerouting process.
• Expiration of images will not lead to additional free space during rerouting.
• On an idle system (no backups, no compaction) rerouting should be able to move about up to 1 TB every hour. On a very busy system, however, it can take much longer.
Because of these limitations, it is important to plan the adding of CRs properly and timely.
NetBackup PureDisk 6.6.1.x or and 50x0 appliance and later versions
Please check the much newer document attached to TECH187051 (Best Practices: NetBackup PureDisk: Adding a Content Router and running Rerouting), before continuing with this tech note. The mentioned document contains a lot of information that also applies to both PureDisk 18.104.22.168 and 6.6.3.
The items listed below are recommendations to ensure re-routing jobs run without the need for manual intervention. The rerouting procedure/documentation is covered in-depth in the Symantec NetBackup PureDisk™ Administrator's Guide Release 6.6.1 in Chapter 13 titled - 'Storage Pool Management'
The following are items that can be checked prior to the commencement of the rerouting process.
For PureDisk 22.214.171.124, Symantec has released a Hotfix Bundle 'NB_PDE_126.96.36.199_EEB20-ET2399563_rollup2' for multiple issues: http://www.symantec.com/docs/TECH162680
....or upgrade to PureDisk 6.6.3a.
Ensure there are no SCSI/HBA errors reported in the messages log of the PDLinux Operating System. These logs are located at /var/log/messages path. If a review of the /var/log/messages indicate that a SCSI/HBA error/event had taken place in the past and no corrective action was performed, engage Symantec Enterprise Technical Support to assist with ensuring the consistency of the data on the Storage Pool.
Check the Content Router Storage Daemon logs (storaged.log) to ensure they are free of errors. These logs capture debug messages involving the Content Router Queue Processing Workflow; the Rerouting Workflow also logs to these files. These files are located at /Storage/log/spoold/storaged.log path
If there were any environmental issues in the past that may have caused corruption, an additional check can be done to ensure there are no fingerprints stored on the PureDisk Content Router that may have been tagged with a Corrupt status. The following command can query the Content Router database and identify if any corrupt fingerprints are still stored on the existing Content Router. The corrupt fingerprints are redirected to the file /Storage/corruptFP.txt and the output of the command lists the count of stored corrupt fingerprints. If the output of the below command is a positive integer, please engage Symantec Enterprise Technical Support to assist with removing the references to these corrupt fingerprints from the PureDisk Content Router.
# /opt/pdcr/bin/dbutil -L --string | egrep ' ,1,4,' > /Storage/corruptFP.txt | xargs wc -l
Steps to perform prior to initiating storagepool contentrouter rerouting:
1. Stop cron on the PureDisk SPA node to prevent any jobs from running:
# service cron stop
2. Down the NetBackup diskpool/diskvolume pointing to the PureDisk storagepool to be rerouted:
nbdevconfig -changestate -stype PureDisk -dp pool_name_here -dv PureDiskVolume -state DOWN
nbdevconfig -updatests -storage_server puredisk-hostname-here -media_server mediaserver-hostname-here -stype PureDisk
3. If PureDisk 188.8.131.52, manually process the transaction log queue on each contentrouter:
a. Manually execute 'PDDO Data Removal' policy and any 'Data Removal' policies one time.
b. Manually execute 'CR Garbage Collection' one time.
c. Manually execute 'CR Queue Processing' policy, after it completes, verify the queue is empty on each content router via:
# /opt/pdcr/bin/crcontrol --queueinfo ...you should see "total queue size : 0". If not, repeat step 1c.
d. Manually execute 'CR Garbage Collection' one time.
e. Check the queue via: # /opt/pdcr/bin/crcontrol --queueinfo ...you should see "total queue size : 0". If not, repeat step 1c.
4. If PureDisk 6.6.3a / 50x0 appliances running 184.108.40.206, manually process the transaction log queue on each contentrouter:
a. Run the 'CR Queue Processing' policy.
b. Manually execute 'CR Queue Processing' policy, after it completes, verify the queue is empty on each content router via:
# /opt/pdcr/bin/crcontrol --queueinfo ...you should see "total queue size : 0". If not, repeat step 1c.
**Once the queue is empty on all content routers, proceed with next steps:
5. If this is a NetBackup PureDisk 50xx appliance based storagepool, disable Patrol Read on each contentrouter's RAID controller(to prevent Patrol Read from running and causing a disk I/O performance slowdown of 30% or more during rerouting):
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpPR -Dsbl -aALL
6. (Optional) If each content router has at least 80% free space per 'df -h' for the volume where /Storage/data exists, disable compaction on each content router to improve rerouting performance:
# /opt/pdcr/bin/crcontrol --compactoff
7. On each NetBackup media server credentialed to write to the deduplication diskpool (and each NetBackup client doing client-side deduplication aka client-direct) disable the pdplugin cache by appending this line to the pd.conf file:
CACHE_DISABLED = 1
Windows pd.conf location: <install_path>\NetBackup\bin\ost-plugins\
UNIX/Linux pd.conf location: /usr/openv/lib/ost-plugins/
8. Initiate Rerouting (either activate the new content router or go to Settings > Topology and click 'Reroute Storagepool' button).
9. After the PureDisk GUI rerouting job has reached step 13 "(53%) Set Mode on GET=yes, PUT=yes, STORAGED=yes, REROUTE=yes", feel free to set the NetBackup diskpool/diskvolume to an UP state and allow NetBackup deduplication backups/duplications to run. Can also start 'cron' on the SPA node at this point.
NOTE: It is completely normal for various steps of the Rerouting job in the PureDisk GUI to remain at certain steps for many hours, days. Rather than solely rely on the PureDisk GUI for progress indicators, use these commands:
-- From SPA (storagepool authority) node, run these to check (ignore the errors/lack of info for the new node):
- current rerouting status, run:
/opt/pdinstall/lib/libssh -n allnodes -v -c "/opt/pdcr/bin/crcontrol --progressinfo 1"
- observe the updates to storaged.log for reroute activity on each node (again ignore the lack of info for new node):
/opt/pdinstall/lib/libssh -n allnodes -v -c "grep -i erout /Storage/log/spoold/storaged.log | tail"
- current disk space usage, first one from operating system, second one from dsstat:
/opt/pdinstall/lib/libssh -n allnodes -v -c "df -h"
/opt/pdinstall/lib/libssh -n allnodes -v -c "/opt/pdcr/bin/crcontrol --dsstat"
NOTE2: Once each content router except the newly added content router '/opt/pdcr/bin/crcontrol --progressinfo 1' will finally report:
"No progress information is available at this time."
....and 'grep -i erout /Storage/log/spoold/storaged.log | tail' will report:
"INFO : Reroute: everything has been rerouted."
....however, each content router will then execute a 'reroute --processqueue' command automatically to process the reroute.tlog* transaction logs, which will also take a long time to complete. This is expected, normal behavior.
If rerouting has already been initiated, the storaged.log captures debug messages from this operation. The messages related to Rerouting can be identified in the storaged.log file by using grep/zgrep commands as follows. If the Rerouting workflow encounters problems, the Content Router will automatically retry the operation. If the rerouting job keeps retrying, redirect the output of the following commands into a file and engage Symantec Enterprise Technical Support with the corruptFP.txt file and the output of these commands. The first command provides the messages from the most recent storaged.log file. The latter command captures messages from previous files that were logrotated.
# grep erout /Storage/log/spoold/storaged.log
# ls -tr | xargs zgrep erout /Storage/log/spoold/storaged.log.*
If performing NetBackup backups to a PureDisk Duplication Option (PDDO) Storage Unit, set the parameter CACHE_DISABLED = 1 in the pd.conf file to avoid potential failures when NetBackup writes backup data to the Storage Unit. If the existing Content Routers in the PureDisk Storage Pool are very close to being full (90% space utilized based on the output of the command /opt/pdcr/bin/crcontrol --dsstat), writing a minimal set of NetBackup backups to the PDDO Storage Unit would ensure that contention between the incoming versus rerouted data is minimized. The pd.conf file is located at the following path on NetBackup Media Servers:
UNIX: /<installpath typically usr>/openv/lib/ost-plugins/pd.conf
In some instances where the Rerouting task may been interrupted, the data may have been rerouted successfully, however this may not be reflected in the web GUI. In other words, the status of an added/removed Content Router node may not reflect the ACTIVE/INACTIVE state respectively under the web GUI > Settings > Topology > CRnodename. If this condition is observed, merely restarting the Rerouting Workflow from the GUI would be sufficient to allow the web GUI components to be updated.
PureDisk 220.127.116.11 Service Request - Troubleshooting rerouting jobs in PureDisk 18.104.22.168
Article URL http://www.symantec.com/docs/TECH168356