Video Screencast Help

Netbackup SLP Backlog Report and Troubleshooting

Created: 29 Feb 2012 • Updated: 16 Jul 2014 | 23 comments
Language Translations
Omar Villa's picture
+7 7 Votes
Login to vote

                To troubleshoot an SLP Backlog first we need to define what a backlog is: Mainly is data that it hasn’t been duplicated to a second or N-Destinations configured under the SLP.

i.e.:

                SLP Name: SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1

                                C1: Copy 1 (Backup)

                                1week: 1 Week Retention

                                MedA-disk1-dsu1: Backup Destination in to STU MedA-disk1-dsu1

                                C2: Copy 2 (Duplicate)

                                3month: 3 Month Retention

                                Tape1-stu1: Duplication Destination in to STU tape1-stu1

                The Storage Lifecycle Policy SLP_C1-1week-disk1-dsu1_C2-3month-tape1-stu1 will Backup Data in a Local Disk DSU named MedA-disk1-dsu1 and when is done will duplicate the data from MedA-disk1-dsu1 to MedA-tape1-stu1, if for any reason the images stored under disk1-stu1 don’t get duplicated to MedA-tape1-stu1 we start to generate a Backlog in MedA-disk1-dsu1 and because the SLP’s nature is to create every image under MedA-disk1-dsu1 with an infinite retention in some point MedA-disk1-dsu1 will get full and all backups will fail, only until the duplication is successful the retention for images under MedA-disk1-dsu1 will change to 1 Week retention and eventually expire.

                Now Imagine a scenario were we have 20 different SLP’s with Different destinations either for Backups or Duplications, troubleshooting this it can be a real challenge and because of this is that created an order of troubleshooting to better find bottlenecks and potential configuration issues.

                During this article will create a set of KSH functions that will end in a final script; main idea is to learn to read each function output and know how interpret each piece of data other key piece is to collect all the needed info in order to deliver a better solution.

Steps:

  1. Dump SLP incomplete images
  2. Get local Disk free space (Advance Disk for this sample) new
  3. Count total Backlog
  4. Count Images by SLP Status new
  5. Count Backlog hold by Media Server
  6. Count total Backups in the last 24 hours
  7. Top Clients Backlog new
  8. Split Images Count by size ranges new
  9. Count total duplications in a daily basis
  10. Count Backlog by SLP.

Dump SLP incomplete images.

               First we need to dump all those images that haven’t been duplicated and send this data in to a file, the following code will dump all incomplete images in to a log file.

               #!/bin/ksh

               LOG_DIR=/var/log/$SCRIPT_NAME/logs

               SLP_DUMP=$LOG_DIR/$SCRIPT_NAME.DUMP

               nbstlutil list -l -image_incomplete > $SLP_DUMP

 

Get local Disk free space (Advance Disk for this sample)

                To make it easy on this sample we use Advance Disk as our first destination option and to know where we stand and how much space left in our environment in order to keep delivering healthy backups we must know how much free space we have in all our media servers.

First step is dump the Disk list and with awk we only grab the DSU name and Used  Space and we count total available space and free space by media server.

        nbdevquery -listdv -stype AdvancedDisk -l | awk '

        {

                SER[$2]+=$7  #Array that holds each DSU Name and stores free space.

                TOTAL+=$7   # Variable that counts the total free space between all DSU’s.

        }

Once awk main body is done we print, headers, total free space and each DSU value going through each cell under SER[] Array.

        END {

                printf ("%-30s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

                printf ("%-30s%s\n", "Media Server", "Free Space")

                for (INC in SER) {printf ("%-30s%.2f TB\n",INC, SER[INC]/1024) }

        }'

 

Output sample:

 

Total Free Space                           4 TB

 

Media Server DSU                        Free Space

MedA-disk1-dsu1                          1.00 TB

MedB-disk1-dsu2                          1.00 TB

MedC-disk1-dsu3                          1.00 TB

MedD-disk1-dsu4                        1.00 TB

 Update:

            After troubleshooting some other sites with different technologies like PureDisk or DataDomain realized that leaving this script only for AdvancedDisk wasn’t that helpful, so decided to decom this function and created a new one that can detect any technology and provided the Free Space by DSU

            We have found that cron is not that smart and sometimes needs help to find the netbackup commands so we decided to introduce the path of each NBU command in to a variable so the script can cleanly be use under cron.

            >$LOG_DIR/total.log                 # Wipeout total.log file used to Summarize Free Space.

            NBDEVQUERY=/usr/openv/netbackup/bin/admincmd/nbdevquery

 

            printf "%-30s%-20s%s\n" "Media Server" "Storage Type" "Free Space"

 

            After we print the header of the function instead of dumping only the AdvancedDisk info we go through the list of STS and add them in to a loop so we can go through each one and gather the free space, logic is the same as the old AdvancedDisk function we only introduced a new column in the output that will tell us what technology the DSU is using.

      

            $NBDEVQUERY -liststs -l | awk '{print $3}' | sort -u | while read STYPE

            do

                        $NBDEVQUERY -listdv -stype $STYPE -l | awk '

                        {

                                    SER[$2]+=$7

                                    TOTAL+=$7

                        }

                        END {

                                    print TOTAL >> "'$LOG_DIR'/total.log"

                                    for (INC in SER)

                                                {printf ("%-30s%-20s%.2f TB\n",INC, awkSTYPE, SER[INC]/1024)}

                        }' awkSTYPE=$STYPE   

            done

 

            We learned a new trick to help a bit with performance and avoid to run the all thing to sum the total free space as you notice we are printing the TOTAL variable in to a total.log file that will hold a list of each DSU and with the following loop we only go through the file and make the Sum.

  

        awk '

        { TOTAL+=$1 }

        END {

            printf ("\n%-50s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

        }' $LOG_DIR/total.log

 

Output Sample:

 

Media Server                  Storage Type        Free Space

MedA-disk1-dsu1          AdvanedDisk       1.00 TB

MedB-disk1-dsu2          AdvanedDisk      1.00 TB

MedC-disk1-dsu3          PureDisk             1.00 TB

MedD-disk1-dsu4          DataDomain        1.00 TB

 

Count total Backlog

                Next step is to know how much data we are holding or better say that it hasn’t been duplicated. This will came from our step 1 were we dumped the data in to $SLP_DUMP file.

     First we summarize every image fragment who haven’t been duplicated

        awk '

        $2=="F" {SUM+=$14}

 

     When sum is done we print the total in TB’s.

 

        END {

                printf ("%-30s%.2f TB\n\n", "Total Backlog ", SUM/1024/1024/1024/1024 )

        }' $SLP_DUMP

Output sample:

Total Backlog                 120 TB

 

Count Images by SLP Status

            There is a key piece in backlog troubleshooting and is images state, to know what is the status of the images is priceless in order make better decisions, in the next piece of code you will find that we count and sum images sizes by SLP State, we only handle the main 6 states but there are some others, but just knowing how many images are NOT_MANAGED or IN_PROCESS should be enough to know if we have corrupted images or policies not using SLP’s.

            Awk will go through the dumped images and compare the 11th column looking for values 0,1,2,3,9,10 who represent NOT_MANAGED, NOT_STARTED, IN_PROCESS, COMPLETE, NOT_STARTED_INACTIVE and IN_PROCESS_INACTIVE states and after the right value is located it is translated in to a string value to later add 1 in to an array using the STATE string value for the Array Cell, idea is to go through the array at the END of awk and print all states through a simple loop.

            printf "%-30s%-15s%s\n" "IMAGE STATUS" "IMAGES COUNT" "SIZE"

            awk '

            $2=="I" {

                        IMAGE=$4

                        STATE_COL=$11

                  

                        if (STATE_COL == 0)       STATE = "NOT_MANAGED"

                        else if (STATE_COL == 1)  STATE = "NOT_STARTED"

                        else if (STATE_COL == 2)  STATE = "IN_PROCESS"

                        else if (STATE_COL == 3)  STATE = "COMPLETE"

                        else if (STATE_COL == 9)  STATE = "NOT_STARTED INACTIVE"

                        else if (STATE_COL == 10) STATE = "IN_PROCESS INACTIVE"

                        else STATE = "OTHER"

                  

                        IMG_STATE_LIST[STATE]+=1

            }

      

            To know the fragment size of the image that was captured in the previous awk block we compare columns 2 and 4, first one to know we are in the fragment (F) line and second to ensure awk loop haven’t change to a different image, a second Array will sum the Fragment sizes but by the same STATE captured in the previous condition.

            $2=="F" && $4==IMAGE {

                        IMG_SUM[STATE]+=$14      

            }

 

            Once we go through the dump file we only print the results going through the Arrays lists, printing the images count and total storage in queue under by SLP state.

            END {

                        for (STATE_ELM in IMG_STATE_LIST)

                                    printf ("%-30s%-15d%.2f TB\n",  STATE_ELM, IMG_STATE_LIST[STATE_ELM], IMG_SUM[STATE_ELM]/1024/1024/1024/1024)

            }' $SLP_DUMP | sort

            printf "\n\n"

 

            In our sample we have a total of 20,000 images in backlog but 15,000 are IN_PROCESS, 2,000 in NOT_MANAGED state and 3000 NOT_STARTED, each of this states demands different actions but just to start it will be good to know why those 2000 images are in NOT_MANAGED state, by personal experience those are either bad images or backup policies not using SLP’s, if you have  many 800 errors is highly a list of  bad images.

Output Sample:

IMAGE STATUS                  IMAGES COUNT               SIZE

IN_PROCESS                       15000                                    100.00 TB

NOT_MANAGED                2000                                     10.00   TB

NOT_STARTED                   3000                                      10.00   TB

 

Count Backlog by Media Server

                Knowing the total backlog only tell us how bad our duplication SLA is. Next step is to strip out a bit our dump and figure out which media server holds most of our data, this will help us to make decisions as assign or de-assign SLP Alternate Readers or change STU’s Concurrent Drives values in order to assign more resources to a specific media server.

                First we capture the Storage Unit name and Fragment size from the $SLP_DUMP file and summarize each fragment in to an Array that will help us to split the backlog by Media Servers DSU’s or STU’s.

                printf "%-30s%s\n" "Media Server DSU" "Backlog Size"

                awk '{if ($2=="F") {print $9,$14} }' $SLP_DUMP | sort |

                awk '

                {

                        MED_LIST[$1]+=$2

                }

                Once sum is done we will go through the MED_LIST[] array and print in TB the total sum under each Array cell value (Media Server DSU Name)

                END {

                                for(MED in MED_LIST)

                                printf ("%-30s%.2f TB\n", MED, MED_LIST[MED]/1024/1024/1024/1024) |"sort"

                }'

                printf "\n\n"

 

Output sample:

 

Media Server DSU                           Backlog Size

MedA-disk1-dsu1                          15.00 TB

MedB-disk1-dsu2                          15.00 TB

MedC-disk1-dsu3                          30.00 TB

MedD-disk1-dsu4                          60.00 TB

                This sample output is telling us that MedD-disk1-dsu4 is holding half of the backlog and is probably our first point to troubleshoot, by expirience first things to look at is: tape drives health, Storage Unit Groups without STU/DSU forcing the SLP to ignore to duplicated the data under the excluded STU/DSU or backup policies over utilizing the STU/DSU, there are lot of possibilities but this are the ones I have found as most common.

Count total Backups in the last 24 hours

                Is impossible to know where we are on Backlogs without knowing how much data we are pulling in, this is why we need to count how much data we backed up in the last 24 hours, this will tell us if the free space we have will be enough for another night of backups also we will be able to compare with another function explained later that show us how much data we are duplicating in a day.

                Because we will dump the last 24 hours images with bpimagelist we first need to know under which netbackup version we are because there is a difference in the output between Netbackup 6 and 7, once we capture the right version we will know the column were the Total Image Size is and let awk do the math.

                NBUVER=$(cat /usr/openv/netbackup/version | grep -i version | awk '{print $NF}' | awk -F. '{print $1}')

          

                if (( NBUVER == 7 )) then

                                IMG_SLP_COL=6

                else

                                IMG_SLP_COL=3

                fi

                The process is very similar to our previous functions, we dump the data and with awk Arrays we summarize how much data we backed up in the last 24 hours, this in total and by SLP, last one is important because we will be able to capture the SLP with the highest backup load.

                bpimagelist -l -hoursago 24 | grep -i ^IMAGE | awk '

                {

                                SLP_LIST[$(NF-SLPCOL)]+=$19   # SLPCOL variable stores KSH Variable $IMG_SLP_COL value.

                                TOTAL+=$19

                }

                Once the sum process is done under the awk main body we print results in awk END foot going through the SLP_LIST[] array showing each SLP total backup data.

                END {

                                printf ("%-30s%.2f TB\n\n", "Total 24hr Backup", TOTAL/1024/1024/1024)

                                printf ("%50s%27s\n", "Policy Name", "Backup Size")

                                for (SLP in SLP_LIST) {printf ("%50s%20.2f TB\n", SLP, SLP_LIST[SLP]/1024/1024/1024)}

                }' SLPCOL=$IMG_SLP_COL

               printf "\n\n"

 

Output sample:

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             4.00 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-3month-MedB-tape1-stu2                              1.00 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-3month-MedC-tape1-stu3                              2.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-9month-MedD-tape1-stu4                             2.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                 0.50 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-6month-MedB-tape1-stu2                              1.50 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-1year-MedC-tape1-stu3                                1.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-5year-MedD-tape1-stu4                                0.00 TB

                Idea is to get a picture of which of the SLP’s is backing up more data. For simplicity of the article SLP’s names only include STU’s there is no SUG’s on this sample.

Count total duplications in a daily basis

                Next step is to know how much data we are duplicating, this can be tricky because we need to go through the jobs database and figure out how much data we have successfully move per job.

                First we capture the list of successful duplicate jobs and list them in to a single line separated by spaces to later use the list with bpdbjobs command, this makes the search of successful jobs way faster than going through one by one.

                printf "%-30s%s\n" "Date" "Duplicated Data"

                bpdbjobs -report | grep -i duplica | grep -i done | awk '{print $1}' | tr '\n' ',' | read JOBSLIST

                echo $JOBSLIST | wc -m | read JOBSCHARS

                ((JOBSNUM=$JOBSCHARS-2))

                echo $JOBSLIST | cut -c1-$JOBSNUM | read FINALJOBSLIST

                With the list ready we will pull 2 columns from each line, the Unix Date and the Size of the job in KB, this two variables will be used to translate the date in to a human format (mm-dd-yyyy) and sum successful duplications

                bpdbjobs -jobid $FINALJOBSLIST -most_columns | awk -F, '{print $9,$15}' | while read UDATE SIZEKB

                do

                                RES=$(bpdbm -ctime $UDATE | awk '{print $4"-"$5"-"$NF}')

                                echo $RES $SIZEKB

                done | \

                With the human date format we will start to summarize the written fragments in to an array and split it by date so we can print a history of the last 4-6 days (the output on the oldest days can change while the jobs got deleted from Activity Monitor, this is why is good to run this script daily and keep some track under the logs folder)

                awk '

                {

                                DAYLIST[$1]+=$2

                }

With the list done, we only go through each Array cell and print the results of the sum's in TB’s.

                END {

                                for (DAYDUP in DAYLIST)

                                                printf ("%-30s%.2f%s\n",DAYDUP, DAYLIST[DAYDUP]/1024/1024/1024, "TB" )

                }' | sort -n

                printf "\n\n"

Output sample:

Date                              Duplicated Data

Feb-23-2012                   0.05TB

Feb-24-2012                   2.10TB

Feb-25-2012                   4.30TB

Feb-26-2012                   5.54TB

Feb-27-2012                   5.58TB

Feb-28-2012                   4.23TB

Feb-29-2012                   0.39TB

               To know if we are doing good or bad on duplications we will need to know how many tape drives do we have available and what kind they are, for this sample we have 10 LTO4 drives shared across 4 Media Servers, with that said, we know we are way behind on performance because each drive should be able to more around 120MB/sec this in utopia world but at least we should expect to move around 2-4TB a day per drive, meaning we probably have a bottleneck at drives or media servers level (we will discuss more about drives and media server performance troubleshooting is a second article, we first build a strong case and later do the right modifications).

                Also there is always the possibility that the reading side (Disk Array) of the duplication is the root cause of the bottleneck but we first check everything in the backup world before we blame the SAN guys.

Top Clients Backlog

            For those cases were one day we don’t have backlog and suddenly in a 24 or 48 hours window we jump to 20TB backlog just from nowhere this normally is because a client decided to dump their 10TB DB in to a folder and because is not part of the exclude list it kills our space and increases the backlog, to quickly detect this we created a function that by default will give us the top 10 backlog clients, so we can engage that customer and see what actions can be taken in order to prevent a higher impact.

            The function allows us to select the number of clients we want to print, default is 10 but it can be any desired number.

            TOP=$1

 

            Soonest we print the header the logic is the same we go through the SLP dumped file and capture each client name under an Array Cell and summarize the fragment value inside the client cell name.

      

            printf "%-30s%s\n" "Client Name" "Backlog Size"

            awk '$2=="F" {print $4,$14} ' $SLP_DUMP | tr '_' ' ' | awk '

            {

                        CLIENT_LIST[$1]+=$3

            }

 

            Once we screen and capture all clients we print them all going through the Array and to capture the clients with the biggest backlog we only sort the list by the Size value (second column) and tail the $TOP variable to print only the top 10 or desire number of clients.

 

            END {

                        for (CLIENT in CLIENT_LIST)

                                    printf ("%-30s%.2f GB\n", CLIENT, CLIENT_LIST[CLIENT]/1024/1024/1024)

            }' | sort -nk 2,2 | tail -$TOP

            print "\n\n"

 

Output Sample:

 

            In our quick sample we print our top 5 clients with the biggest backlog and we easily see which clients represent at least 45% (55TB) of our 120TB backlog, not a bad place to start looking.

 

Client Name                       Backlog Size

Windows1                     2500GB

Unix1                            2500GB

Exchange1                    10000GB

MSSQL1                       15000GB

Oracle1                         25000GB

 

Count Images by Size Ranges

            Tuning the LIFECYCLE_PARAMETERS file can be a challenge if we don’t have the right data and guessing or doing change by test and error it doesn’t goes well with backlog, because of this and to better know what is NBSTSERV doing with the images is necessary to know how many images we have and their sizes with this info we can tune things like MIN/MAX_GB_SIZE_PER_DUPLICATION_JOB values, we will see this in our output sample explanation.

Code is quite simple we establish the ranges we want to print and capture the image and image fragment size in to an Array that later we will scan and compare each cell value with the hardcoded ranges we established, Increase by one the range variable and when the loop is done we only print the images count by range.

 

        awk '

        $2=="I" {

                IMAGE=$4

        }

        $2=="F" && $4==IMAGE {

                IMGSUM[IMAGE]+=$14

        }

 

            Hardcoded ranges values:

 

        END {

                S100MB=104857600

                S500MB=524288000

                S1GB=1073741824

                S5GB=5368709120

                S10GB=10737418240

                S50GB=53687091200

                S100GB=107374182400

                S250GB=268435456000

                S500GB=536870912000

                S1TB=1073741824000

 

            Loop will go through the Array values and compare them with the ranges we want and increase a count variable that we will print later.

 

                for (IMGSIZE in IMGSUM) {

                        if (IMGSUM[IMGSIZE] <= S100MB)                                            S100MB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S100MB && IMGSUM[IMGSIZE] <= S500MB)         S500MB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S500MB && IMGSUM[IMGSIZE] <= S1GB)           S1GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S1GB   && IMGSUM[IMGSIZE] <= S5GB)           S5GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S5GB   && IMGSUM[IMGSIZE] <= S10GB)          S10GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S10GB  && IMGSUM[IMGSIZE] <= S50GB)          S50GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S50GB  && IMGSUM[IMGSIZE] <= S100GB)         S100GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S100GB && IMGSUM[IMGSIZE] <= S250GB)         S250GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S250GB && IMGSUM[IMGSIZE] <= S500GB)         S500GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S500GB && IMGSUM[IMGSIZE] <= S1TB)           S1TB_COUNT+=1

                        else                                                                  SM1TB_COUNT+=1

                }

          

                        printf ("Images Size Range      Image Count\n")

                        printf ("< 100MB                %d\n", S100MB_COUNT)

                        printf ("> 100MB < 500MB        %d\n", S500MB_COUNT)

                        printf ("> 500MB < 1GB          %d\n", S1GB_COUNT)

                        printf ("> 1GB   < 5GB          %d\n", S5GB_COUNT)

                        printf ("> 5GB   < 10GB         %d\n", S10GB_COUNT)

                        printf ("> 10GB  < 50GB         %d\n", S50GB_COUNT)

                        printf ("> 50GB  < 100GB        %d\n", S100GB_COUNT)

                        printf ("> 100GB < 250GB        %d\n", S250GB_COUNT)

                        printf ("> 250GB < 500GB        %d\n", S500GB_COUNT)

                        printf ("> 500GB < 1TB          %d\n", S1TB_COUNT)

                        printf ("> 1TB                  %d\n", SM1TB_COUNT)                  

        }' $SLP_DUMP

 

Output Sample:

Image Range            Image Count

< 100MB                     7000

> 100MB < 500MB    1000

> 500MB < 1GB          1500

> 1GB   < 5GB             800

> 5GB   < 10GB           500

> 10GB  < 50GB         500

> 50GB  < 100GB       300

> 100GB < 250GB      200

> 250GB < 500GB      50

> 500GB < 1TB           25

> 1TB                           20

            Two easy catch’s here are the 7000 images smaller than 100MB and the 20 images bigger than 1TB, for the 7000 images I will first check what is the MIN_GB_SIZE_PER_DUPLICATION_JOB value under             the LIFECYCLE_PARAMETERS file if value is too small is very likely we are creating tons of duplication jobs with only 1 or 2 images in and because they are so small the tape drives mount and dismount media every 10 minutes plus they never reach max speeds falling in a potential “Show Shine Effect”, increasing the MIN_GB_SIZE_PER_DUPLICATION_JOB helps NBSTSERV to better process the images and bundle them in to a single job based on SLP, SLP Priority, Retention, Source and Destination if all this match NBSTSERV will batch multiple images in to a single and bigger duplication job.

            For the Large images there are more things to check like compare them each image with the top 10 clients list output and see if those clients own any of these images, second thing can be to check the SLP state of the images because we have some NOT_MANAGED images it could be some of this big guys are stuck because they are corrupted, also tuning the MAX_GB_SIZE_PER_DUPLICATION_JOB value to fit more data in to a single tape could help to improve each image duplication.

 

Count Backlog by SLP

                Last step of the report is to know which SLP holds most of the backlog or how balanced the load is with this we can probably modify a couple of SLP’s and fix the issue or assign more resources to the SLP’s who have most of the load.

                 Process is to list each SLP and dump the incomplete images by SLP and do the corresponding math summarizing all fragments, in this case we don’t need awk Array because we already know the SLP we are working on, we only need to introduce the name of the SLP in to the printing piece of the report.

                printf "%50s%27s\n" "SLP Name" "Backlog Size"

                nbstl -b | while read SLP

                do

                     nbstlutil list -lifecycle $SLP -image_incomplete | awk '

                     $2=="F" { SUM+=$(NF-2) }

                      END {

                                printf ("%50s%20.2f TB\n", awkSLP, SUM/1024/1024/1024/1024)

                      } ' awkSLP=$SLP

               done | sort

               printf "\n\n"

Output sample:

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             10.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-6month-MedA-tape1-stu1                             0.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                5.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-5year-MedA-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-3month-MedB-tape1-stu1                              10.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-6month-MedB-tape1-stu1                              1.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-1year-MedB-tape1-stu1                                3.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-5year-MedB-tape1-stu1                                 1.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-3month-MedC-tape1-stu1                              30.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-6month-MedC-tape1-stu1                              0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-1year-MedC-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-5year-MedC-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1                             30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1                             25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                                5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                                0.00 TB

                Our data shows 4 SLP’s with a double digit number but the most interesting ones are:

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1               30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1               25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                  5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                  0.00 TB

                 Because 60 TB of the data are hold by Media Server MedD matching step 4 (Count Backlog hold by Media Server) now we have more granular data because we know which SLP’s we need to attack first and also figure out why MedD is heavily use while MedA and MedB are on vacation. Second check point is MedC with 30TB clog in the 3 month SLP.

                 Possibilities are huge, but with the final report we can catch some obvious issues, this only leads to the first phase of SLP’s Troubleshooting that is to know were the major problems are.

                 Final script is attached and can be used in Solaris environments haven’t try on any other Unix/Linux platform but it shouldn’t a problem and maybe will need some slight modifications if you have a different platform and the script fails please let me know or upload the fix for your OS version.

                 Another note, if you don’t have Advance Disk configuration only comment the function GetAdvDiskFreeSpace or adapt/create a new function for whatever you have as DataDomain or other 3rd Party Vendors Configs.

Script Syntax:

        SYNTAX: BacklogCheck.ksh -a | -sSBbDFMph [-m <email>] [-C <NClients>] | [-c <Ndays>]

                -a:     Print Full Report

                -s:     Print Short Report (NO SLP's)

                -S:     Print, Count and Sum SLP Images State

                -B:     Print Total Backlog in TB

                -b:     Print last 24hr Backup Info splited by SLP

                -c:     Delete log files older than N days based on User Argunment

                -C:     Get Top X Clients backlog were X is the desired top clients list

                -D:     Print Sum of Daily Duplications

                -i:     Print images count by size range

                -F:     Print DSU's Free Space

                -M:     Print Backlog hold by Media Server

                -m:     Send Report to a Specified eMail

                -h:     Print this help.

        Sample: ./BacklogCheck.ksh -a -m darth.vader@thedarkside.com

Full Report Output:

Total Backlog                               120 TB

 

Media Server DSU                        Free Space

MedA-disk1-dsu1                          1.00 TB

MedB-disk1-dsu2                          1.00 TB

MedC-disk1-dsu3                         1.00 TB

MedD-disk1-dsu4                          1.00 TB

Total Free Space                           4 TB

 

IMAGE STATUS                  IMAGES COUNT               SIZE

IN_PROCESS                       15000                                    100.00 TB

NOT_MANAGED                2000                                     10.00   TB

NOT_STARTED                   3000                                      10.00   TB

 

Media Server DSU                         Backlog Size

MedA-disk1-dsu1                          15.00 TB

MedB-disk1-dsu2                          15.00 TB

MedC-disk1-dsu3                          30.00 TB

MedD-disk1-dsu4                          60.00 TB

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             4.00 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-3month-MedB-tape1-stu2                              1.00 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-3month-MedC-tape1-stu3                              2.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-9month-MedD-tape1-stu4                             2.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                0.50 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-6month-MedB-tape1-stu2                              1.50 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-1year-MedC-tape1-stu3                                 1.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-5year-MedD-tape1-stu4                                 0.00 TB

 

Date                               Duplicated Data

Feb-23-2012                   0.05TB

Feb-24-2012                   2.10TB

Feb-25-2012                   4.30TB

Feb-26-2012                   5.54TB

Feb-27-2012                   5.58TB

Feb-28-2012                   4.23TB

Feb-29-2012                   0.39TB

 

Client Name                       Backlog Size

Windows1                     2500GB

Unix1                            2500GB

Exchange1                    10000GB

MSSQL1                       15000GB

Oracle1                         25000GB

 

Image Range            Image Count

< 100MB                     7000

> 100MB < 500MB    1000

> 500MB < 1GB          1500

> 1GB   < 5GB             800

> 5GB   < 10GB           500

> 10GB  < 50GB         500

> 50GB  < 100GB       300

> 100GB < 250GB      200

> 250GB < 500GB      50

> 500GB < 1TB           25

> 1TB                           20

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             10.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-6month-MedA-tape1-stu1                             0.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                5.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-5year-MedA-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-3month-MedB-tape1-stu1                              10.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-6month-MedB-tape1-stu1                              1.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-1year-MedB-tape1-stu1                                 3.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-5year-MedB-tape1-stu1                                 1.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-3month-MedC-tape1-stu1                              30.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-6month-MedC-tape1-stu1                              0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-1year-MedC-tape1-stu1                                0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-5year-MedC-tape1-stu1                                0.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1                             30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1                             25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                                5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                                0.00 TB

 

Omar A. Villa

Netbackup Expert

These are my personal views and not those of the company I work for.

 

Comments 23 CommentsJump to latest comment

Omar Villa's picture

Please posts any comments or bugs under the script and if you have any improvements to the report output is very wellcome.

 

Best Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
mph999's picture

Looks good, will have  a play when I have time.

Nice to see a script written properly.

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
0
Login to vote
Nicolai's picture

Nice Script !

The -lifecycle_only option is  required if you have Data Domain or simelar boxes. Else backup on those applianced are counted as backlog.

Running the script on a Linux box causes a cut error when running with the -e option. (this may be caused by no backlog).

cut: invalid byte, character or field list

yes From me

 

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

0
Login to vote
Nathan Kippen's picture

Just adding another reference for folks:

Troubleshooting Auto Image Replication ... a lot of stuff related to SLP

http://www.symantec.com/business/support/index?page=content&id=HOWTO42477

Also here is a link to NB 7.1 Best Practice - Using SLPs and AIR

http://www.symantec.com/business/support/index?page=content&id=TECH153154

 

 

 

Symantec Certified Specialist
(NBU 7.5.0.6)
Don't forget to vote or mark solution!

+1
Login to vote
revaroo's picture

Excellen job.. We need more of these informative posts yes

0
Login to vote
Omar Villa's picture

Hi Guys,

       Appreciate the comments and yes revaroo more just came up, I just updated the article and uploaded a newer version of the script with a lot of more functionalities, please take a look the new functions are:

Get local Disk Free Space (Any type of Disk)

Count images by SLP Status

Top Clients Backlog

Split Images Count by Size Ranges

 

All explanations are in the article.

Please let me know what you think and if you find any bug or improvements.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
Maurice Byrd's picture

I'm working in an all Windows environment.  Does anyone know how to troubleshoot backlog issues on Windows Server 2008?

0
Login to vote
Omar Villa's picture

Is pretty much the same concepts but you will need to develot the script for windows, maybe a PowerShell will do the job, unfortunally I only have this for Unix/Linux and dont have a windows environment to develop the windows version, but if you see the output of the report is what you need to look for on the windows side.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
huangj11's picture

Great,good job.

欢迎访问个人站点 快备份
Welcome to my personal site quick backup
http://www.kbeifen.com/

0
Login to vote
Joe Despres's picture

Do you have a version for Linux [RH].....

Thanks.....

 

Joe Despres

0
Login to vote
Nicolai's picture

The script is written in Korn Shell - it work on Linux as well.

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

0
Login to vote
Vinh La's picture

Omar,

I was going to ask you for this script :)

 

Thanks man.

 

Vinh.

0
Login to vote
Omar Villa's picture

Hi,

     Is been a while since last update I have a couple of improvements over the script and a bug fix, please take a look at new version 1.8.4.B and let me know if you have have an issue, bugs or comments:

 

Updates:

#        DATE: 06/05/2012 BY: Omar A Villa
#        MODIFICATION: Introduced GetMediaServerDups Function (Ver 1.8.1.B)
#        DATE: 07/15/2013 BY: Omar A Villa
#        MODIFICATION: Added GetLibraryBacklog Function (Ver 1.8.2.B)
#        DATE: 11/14/2013 BY: Omar A Villa
#        MODIFICATION: Fixed bug under ValidateFilesAndFolders function when script is run by first time (Ver 1.8.3.B)
#        DATE: 11/14/2013 BY: Omar A Villa
#        MODIFICATION: Improved SendMail function to identify mail or mailx commands (Ver 1.8.4.B)

 

Open code to see functions in case you want to see the details.

 

Best Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

+1
Login to vote
Omar Villa's picture

Hi,

       There are couple of  improvements added in to the script, please check it out and thanks to Kevin Good for he's input on the GetSLPsBklog function improvement, very fine piece of code he wrote.

 

#        DATE: 02/01/2014 BY: Kevin Good
#        MODIFICATION: Improved GetSLPsBklog algorithm. Print only SLP's with backlog (Ver 1.8.5.B)
#        DATE: 02/01/2014 BY: Omar A Villa
#        MODIFICATION: Added -k parameter in to main to only list SLP's with backlog (Ver 1.8.6.B)

 

Best Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
backdfup's picture

Good read Omar!! I will put this to work for me.

 

DC Martin

0
Login to vote
HoldTheLine's picture

This is a really, really good script!  Thanks for not only putting in the work to do this but also for sharing.  I am seeing some odd output and am not sure if there is some customization that has to be done to get it working in some environments, for example:

 

Running with the -a switch for the full report I see this right

 

"Date                          Duplicated Data
Dec-31-1969                   4314.97TB"
 

 

@a                11.8195
S5                1.79069
 

the @a looks like a disk device, not sure what the S5 is but that adds up to the total backlog of about 12TB.  Any idea what that is supposed to be telling me? The time stamp of Dec-31-1969 is confusing to me as well.  This is on Linux RH if it matters.

 

Also, under 24 hour backup there is a heading for Policy name but it lists it not as the name of a policy but a number, in this case 0:

Total 24hr Backup             24.48 TB

                                       Policy Name                Backup Size
                                                 0               24.48 TB

 

0
Login to vote
Omar Villa's picture

Hi,

          I think I can explain and modify the script:

          1. about the output:

           "Date                          Duplicated Data
            Dec-31-1969                   4314.97TB"

             There is a bug in the script with NBU 7.5, this output is suppost to print the amount of data duplicated in the last 5 days, for some reason the output is comming up with the oldest date NBU supports, will take a look and update the script the soonest I can.

 

           2. about:

              @a                11.8195
              S5                1.79069

             Sorry on this one I forgot to add the headers, this is are the first 2 characters of your tapes barcodes, the intention is to let you know what source library or VTL haves the backlog so you can focus on that and you are right @a is your Disk that holds 11.8TB's and S5 is a library with barcode tapes S5 that holds 1.79TB of the backlog, next version I think I will need to print the full disk name for those cases where we have multiple instances.

 

            3. another bug:

              Total 24hr Backup             24.48 TB

                                       Policy Name                Backup Size
                                                 0               24.48 TB

              Let me check on this one but Im sure is the same issue around bpdbjobs in NBU 7.5 that changed a bit I'm sure the columns moved and is messing with the output.

 

Soonest I have everything fix will uploaded the script and post the output, hope it doesnt take me too long.

 

Thanks a lot for your imput on enhancing this script.

Best Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
Omar Villa's picture

Hi,

    Script is been updated and fix, please check for new version, here are the updates:

 

#        DATE: 03/06/2013 BY: Omar A Villa
#        MODIFICATION: Fixed bug on GetSLPsBklog that was reporting 0's for each SLP backlog size (Ver 1.8.7.B)
#        DATE: 04/02/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetDailyDups function using bperror instead of bpdbjobs fixing output bug (Ver 1.8.8.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetLibraryBacklog function; Introduced header and splited report by Disk or Barcodes first 2 chars (Ver 1.8.9.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetSLPLast24hrBkp function; Removed NBU version search steps (Ver 1.8.10.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Introduced Header to GetSLPsBklog function (Ver 1.8.11.B)

 

Any questions please let me know.

Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

+1
Login to vote
HoldTheLine's picture

Looking great so far, thanks!

 

0
Login to vote
Andrew Madsen's picture

Omar,

While you are at it:

 

slp_l4enbmed03_l4nbpdpa1_rep_l8nbpdpa1_logs_1mon 0.00 TB

slp_l8vnb5220a_l8nbpdpa1_rep_1mon 0.00 TB

slp_l4inb5220a-passthru-l4nbpbpa1_rep_l8nbpdpa1_2wks 0.00 TB

slp_l4enb5220a_l4nb5020a_l8nb5020a_1mon 0.00 TB

slp_l4enbmed02_rep_l8nbpdpa1_l4nb5020a_1mon 0.00 TB

slp_l4inb5220a-passthru-l4nbpbpa1_rep_l8nbpdpa1_1mon 0.00 TB

slp_l4enbmed03_rep_l8nbpdpa1_l4nb5020a_1mon 0.00 TB

slp_l4enbmed01_FS_rep_1mon 0.00 TB

slp_l4vnb5220b_l4nbpdpb1_rep_1mon 0.00 TB

slp_l4vnb5220a_l4nbpdpb1_rep_1mon 0.00 TB

slp_l4vnb5220b_l4nbpdpa1_rep_1mon 0.00 TB

slp_l4vnb5220a_l4nbpdpa1_rep_1mon 0.00 TB

slp_l8inb5220a_l8nbpdpa1_rep_l4nbpdpa1_2wks 0.00 TB

slp_l8inb5220a_l8nbpdpa1_rep_l4nbpdpa1_1mon 0.00 TB

slp_l4enbmed02_rep_l8nbpdpa1_c7nb5220a_dup_l4nb5020a 0.00 TB

slp_l4enbmed02_l4nbpdpa1_rep_l8nbpdpa1_logs_2Weeks 0.00 TB

Those values should be something besides 0.00 TB

The above comments are not to be construed as an official stance of the company I work for; hell half the time they are not even an official stance for me.

0
Login to vote
Omar Villa's picture

Hi Andrew,

            I think you might be running an old version of the script, in 1.8.6.B we fixed this, now it only presents SLP's with backlog, if you are running the newest version then those SPL's have a very small backlog and can be the math and printf are eating the value what this means is probably these SLP's backlog can be something like 0.001TB, to confirm you can go to the GetSLPsBklog function and modify this line from:

            printf ("%50s%20.2f TB\n", FoundSLP, SUM[FoundSLP]/1024/1024/1024/1024)

           TO

            printf ("%50s%20.2f TB\n", FoundSLP, SUM[FoundSLP]/1024/1024)

 

This will print in MB instead of TB.

 

Check it out and let us know.

Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
Omar Villa's picture

Hi Everyone,

            Have some enhancements and fixed a big bug around the backlog count cutting the size of the backlogs by 50% because script dump was also counting the Copy 1 data instead of only Copy 2 or highier, if you have more then 3 copies is very likely you will need to customize the script a bit, please checkout the modified functions and let me know if you have any questions.

 

#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Decommed GetMediaServerDups function (Ver 1.8.12.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Introduced GetMediaServerDupsAndSpeeds function (Ver 1.9.0.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Modified GetDailyDups function to print Average Speeds (Ver 1.9.1.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Enhanced GetTotalBklog function to print SLP Copies Backlogs (Ver 1.9.2.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Modified DataDumps function to fix bug that was doubling backlog size (Ver 1.9.3.B)

 

Have a good one.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote
Omar Villa's picture

some few slight updates

 

#        MODIFICATION: Modified DataDumps Function to fix bug for cases where there is only 1 dup copy (Ver 1.9.4.B)
#        DATE: 04/23/2014 BY: Omar A Villa
#        MODIFICATION: Enhanced CleanLogs Function adding an array and loop that will go through all logs names (Ver 1.9.5.B)
#        DATE: 04/24/2014 BY: Omar A Villa
#        MODIFICATION: Introduced GetOS function to help with those commands with syntax differences (ver 1.10.0.B)

 

Enjoy.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

 

0
Login to vote