Time vs. Data
Yesterday I was talking with a Symantec user about the decisions you have to make when picking how to recover from a failure. Like most companies they had a whole slew of options from clustering and high-availability, to replication, snapshots and tape. Most people we talk with have some idea of the amount of time they can tolerate to get back up from a failure and the amount of data they are willing to lose but these two things (time and data lost) are more related than most people think and this user especially understood that...particularly when it comes to applications. It really is one of the biggest problems applications have when you try to back them up. They have to be stopped or paused since some data may be in memory or logs that haven't been fully written to disk. So most backup apps have a "hot backup" mode or quiesce (I can never spell that word) that lets you flush the application out so it can be backed up in a known good state but still let it run while it's in this mode. In the perfect world you could do this all the time which would mean you could back it up at any time and have instant recovery since it is always in a clean state. It's not a perfect world however and there is a performance impact that these apps take when running in backup mode. So people schedule them at some regular frequency.
When the app goes down, it is most likely not in a clean state. So people call this a crash-consistent. You can still recover with nearly zero data loss but you may have to replay some logs and do some manual steps outside of the NetBackup UI. In other words you're recovery time takes longer. If you're restoring from an application consistent state though it's clean and comes up right away, no mucking around. Quicker recovery at the price of data loss. But this is the trade off people need to make when recovering.
- Pick the last good clean app-consistent backup and live with some data loss
- Pick the crash consistent time so that no data is lost but it may take a little longer to get back to a running state
This user and I discussed a lot of the situations where you may want to go back and forth between the two options. Say it's high noon and your ordering system goes down. Every second you can't ring out orders is a lot of money lost so you may favor option #1 (but you'd recover the lost data separately at night or during scheduled downtime). But if that same situation occurred in the middle of the night you may go for option #2 since getting the data fully back is more important than being offline longer.
The point is, think about this next time you write an SLA or are in a situation where you have to choose how to recover. NetBackup has a lot of options and we'll openly tell you which ones are closer to #1 or #2 but we try to offer both ends of the spectrum so you can choose what works best for your situation.