MB garbage collection running but does not appear to be doing anything

Article:TECH187481  |  Created: 2012-04-26  |  Updated: 2012-07-21  |  Article URL http://www.symantec.com/docs/TECH187481
Article Type
Technical Solution


Environment

Issue



MB Garbage Collection starts on node1 (mbe,cr) and runs for days without doing anything (from what we can tell)
The MBGC on the other mbe,cr nodes does not start since this one never completes.
Need assistance to figure out why and what to do about it.

Recommend to determine what is happening with MB garbage collection.
PDDO data removal, cr garbage collection and queue processing are all running successfully.

MB garbage collection has not run for over 1 month, and the last time it was killed it had run for 170 hours.
The job gets suck on node pdnode03 and does not move to the other nodes.

We can see the Storage/tmp/workflow.1993677 file is created, but not updating after the DEREF line.

Doing Storage/tmp # lsof | grep workflow.1993677
shows that these 4 processes are accessing that file.

pdagent 29915 root 9u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 1u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 2u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 9u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677

Used top (or equiv) and it shows that postmaster is showing high CPU utilization so it appears to be doing something.


Environment



Puredisk 6 node VCS cluster
PureDisk 6.6.1.2 EEB20 with latest rollup2 version is installed.
There are 5 Active nodes.  SPA, and 4 MB, CR's and 1 SPARE Node
There are two backup environments running 6.5.6 and one running 7.1.0.3, both are sending PDDO backups to this storage pool.

Problem is with node1 (mbe,cr)
MB Garbage collection appears to be running but not doing anything

Node 1 "pdnode03" (mbe,cr)
db Name db ID db Size Bytes
------------ --------- --------------- -----------
crdb 479618193 33,661,723,124 ( 31.3 Gb)
mb 16387 13,476,079,092 ( 12.6 Gb)
postgres 10819 4,001,268 ( 3.8 Mb)
----------- --------- --------------- -----------
Total 47,141,803,484 ( 43.9 Gb)


Solution



Rebuild the MBE database on the problem mbe,cr node.

This will get the db password.
 
# export PGPASSWORD=`/opt/pdag/bin/php /opt/pdconfigure/upgrade/generic/actions/lib/GetEntry.php /etc/puredisk/spa.cfg spadb password`
 
So run before running all the psql commands
 
 ----> To dump/import mb db on pdnode03:
 
-- Terminate Gracefully any running MBGC jobs.
-- Stop Linux scheduler 'cron' (service cron stop) on SPA node and if PDDO is involved, down the diskpool in NBU.
-- Wait for rest of jobs to complete
-- Once no jobs running, stop all PureDisk services.
 
-- Check size of mb db on pdnode03:
 
To see how large databases are:
 
# du --max-depth=1 -h /Storage/databases/pddb/data/base
 
- To get information for which database is which
 
# /opt/pddb/bin/psql -U pddb ca -x -c "select OID,datname from pg_database"
 
-- start the database portion (pddb):
 
# /etc/init.d/puredisk start pddb
 
Next, dump, rename, recreate and import the mb db on pdnode03:
 
# /opt/pddb/bin/pg_dump -U pddb mb -f /Storage/tmp/mb.sql
   
Note: after above command completes, be sure to tail the file 'tail /Storage/tmp/mb.sql' to ensure the dump completed without errors....you should see at the end lines like this:
 
           --
           -- PostgreSQL database dump complete
           --
 
.....if the dump completed with above verbiage at the end of the .sql file, then proceed. If not, do not proceed as we'll need to troubleshoot the mb db dump failure for postgresql issues, etc.
 
Next step, rename:
 
# /opt/pddb/bin/psql -U pddb template1 -c "alter database mb rename to mb_old"
 
Create new mb db:
 
# /opt/pddb/bin/psql -U pddb template1 -c "create database mb"
 
Import from dump file:
 
# /opt/pddb/bin/psql -U pddb mb -f /Storage/tmp/mb.sql
 
-- Recheck sizes...should see big size drop:
 
# /opt/pddb/bin/psql -U pddb ca -x -c "select OID,datname from pg_database"
# du --max-depth=1 -h /Storage/databases/pddb/data/base
 
Bring up PureDisk services and enable these logs just in case we will have problems with the MB Garbage collection.
 
On the SPA node:
                - Put pdagent on server nodes in debug (pdagent --debug)
                - Run MBGC
                - After it failed, find the correct pd_jobstep_<jobstepid>.php
                (a "grep MBGarbage /Storage/tmp/pd_jobstep*" will get you this)
                - Run the script manually:  /opt/pdag/bin/php /Storage/tmp/pd_jobstep_xyz.php
                (preferably in screen)
 
If there are still problems, there is most likely a problem with the pdagent/pdwfe.

---- Further troubleshooting:

What you can do is this from spa node:
 - Put pdagent on server nodes in debug (pdagent --debug)
 - Run MBGC
 - After it failed, find the correct pd_jobstep_<jobstepid>.php
 (a "grep MBGarbage /Storage/tmp/pd_jobstep*" will get you this)
 - Run the script manually:  /opt/pdag/bin/php /Storage/tmp/pd_jobstep_xyz.php
 (preferably in screen)

If this works, the issue is with the pdagent/pdwfe. Check those logs first.

If this fails as well, but without a decent error:
 - Find Application::start(); line in jobstep PHP file
 - Add just after this line the following 2 lines:
  Debug::$debug = true;
  Debug::$debug2Screen = true;
 - Rerun and check for error in debug output (might be very messy)





Article URL http://www.symantec.com/docs/TECH187481


Terms of use for this information are found in Legal Notices