Video Screencast Help

Disk volume is down(2074). Dedupe DB and dedupe data sharing same file system

Created: 13 Mar 2013 • Updated: 19 Aug 2013 | 8 comments
This issue has been solved. See solution.

Hi,

Ramdonly my redhat media server 7.5.0.4 goes down with "Disk volume is down(2074)" error message. It happens on Fridays (but not alls). Usually it get fixed getting up the pool using /usr/openv/netbackup/bin/admincmd/nbdevconfig -changestate -stype PureDisk -dp dbPoolBO -dv PureDiskVolume -state UP or rebooting the server.

I have read that this problem is related with I/O disk and some timeouts. It make sense to me as that this problem only happens on Fridays when full backups runs. In order to deal with the IO I have reduced the maximum of the concurrent jobs too.

Deduplication database is located in the same file system as the deduplication Data. So, maybe this can be one of the cause. Using the iotop tool I have seen that Postgres jobs are reading/writing intensively in the same file system.

I wonder if postgres database can be moved to another file system to reduce IO in the deduplication Data file system. Is it safe? Is it worth it?

Another problem indicator I have seen is the /deduplication/databases/pddb/postgresql.log logfile. There are many lines stating a parameter to be tuned

<2013-03-13 05:31:55 CET>LOG:  checkpoints are occurring too frequently (16 seconds apart)

<2013-03-13 05:31:55 CET>HINT:  Consider increasing the configuration parameter "checkpoint_segments".

So maybe postgress database is not behaving as expected. It is safe to tune this parameter?

These are only some thoughts and maybe I am pointing in the wrong direction. Please feel free to ask more logs,etc

Thank for your time

Best Regards

Operating Systems:

Comments 8 CommentsJump to latest comment

Nicolai's picture

There may be a work around until the  I/O issue is fixed. See solution 8.

http://www.symantec.com/docs/TECH156743

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

Juasiepo's picture

We tried the solution 8 but it didn't work as jobs didn't failed but seemed frozen

Thank you for your time

Best regards

Juasiepo's picture

Hi all,

Coming back to the postgresql.log logfile:

<2013-03-13 05:31:55 CET>LOG:  checkpoints are occurring too frequently (16 seconds apart)

<2013-03-13 05:31:55 CET>HINT:  Consider increasing the configuration parameter "checkpoint_segments".

It seems that having the checkpoint_segments parameter too low is a performance killer and can be causing the the Disk volume failure as it get’s blocked waiting for IO

http://comments.gmane.org/gmane.comp.db.postgresql.performance/28859

14.4.6. Increase checkpoint_segments

Temporarily increasing the checkpoint_segments configuration variable can also make large data loads faster. This is because loading a large amount of data into PostgreSQL will cause checkpoints to occur more often than the normal checkpoint frequency (specified by the checkpoint_timeout configuration variable). Whenever a checkpoint occurs, all dirty pages must be flushed to disk. By increasing checkpoint_segments temporarily during bulk data loads, the number of checkpoints that are required can be reduced.

Two days ago the iotop output of de deduplication disk was like this:

Meaning the postgres database was putting more IO on the disk than the spoold process blush

After modifying from 10 to 32 the checkpoint_segments parameter and reloading the configuration of the database the errors in the postgresql.log logfile disappeared and the output of the iotop was like this:

Meaning the IO from the database has been reduced to almost 0 and Backups seemed to run faster. But of course It is necessary to wait several days to see if this is a constant behaviour

Anyone tried to tune the postgres database? Similar experiences?

Best regards

Juasiepo's picture

Hi,

The problem seemed located in the I/O disk but the database tuning was not enough. Limiting the I/O in the disk pool seems to solve the problem as it was set to unlimited previously

https://www.symantec.com/business/support/index?page=content&pmv=print&impressions=&viewlocale=&id=HOWTO70491

Limit I/O streams

Select to limit the number of read and write streams (that is, jobs) for each volume in the disk pool. A job may read backup images or write backup images. By default, there is no limit. If you select this property, also configure the number of streams to allow per volume.

When the limit is reached, NetBackup chooses another volume for write operations, if available. If not available, NetBackup queues jobs until a volume is available.

Too many streams may degrade performance because of disk thrashing. Disk thrashing is excessive swapping of data between RAM and a hard disk drive. Fewer streams can improve throughput, which may increase the number of jobs that complete in a specific time period.

30 I/O streams for my configuration are working fine and no more Disk volume is down(2074) has appeared for the moment smiley.... .

Hope it help anyone having same problem.

Juasiepo's picture

Hi,

After several days working ok, the same problem appeared again. As per support indication I am trying the TECH156490 technote  :

Status 213: PureDisk Volume is repeatedly being marked down and up automatically.

Regards

Juasiepo's picture

Hi,

After upgrading to 7.5.0.6, it seems that the Disk volume is down(2074) error has dissapeared.

Even with higher values of Limit I/O streams or Maximum number of concurrent jobs.

But still under investigation as we upgraded last Saturday.

Regards

SOLUTION
belelder7's picture

I think your system must be upgraded and is necessary your job will not be limited...and this will not distract your work. Maybe there are application that are making your system heavily loaded with files that are not needed and must be removed.