Three ‘Simple’ Complexities of Data Protection
Data is the foundation of information; information leads to knowledge, and knowledge is power. There can be little disagreement that data then has value. Digital data has become the new world currency and protecting these valuable assets is a central concern of business continuity management. Data loss, data unavailability and data corruption will all have an adverse economic impact the organization. Not only do we need to ensure that data is usable and available but we also need to ensure that it is protected from unauthorized use. While protecting digital data doesn’t sound particularly challenging; it typically begins with a simple task, make an extra copy. Making and managing these extra copies, however, remains one of the most common pain points for any organization.
There are three fundamental aspects of data protection that lead to increasing complexities.
- There are lots of data-copy options. These options vary in both cost and functionality; there is no one-size-fits-all solution. To be optimal, data-copy solutions should be proportionate with the data value.
- There are lots and lots of digital data to copy and protect. Digital data changes fast and often, there simply isn’t enough budget to make a backup copy of every piece of data, every time it changes. Even though making backup copies is expensive, however, we can’t afford to lose the important stuff.
- Digital data lacks governance information. Data can’t be protected if we don’t know who owns the data, or how data is being. Lack of this fundamental information about data increases both operating cost expense and operational risk. Operating costs increase when irrelevant-data is copied unnecessarily or data is copied more often than required, however, the consequence of not coping mission critical data because it was not identified could be catastrophic.
Protecting digital data means that we have to make the right choice about the right solution for the right data.
RANGE OF DATA COPY METHODS
One reason that data protection is complex is that there are several techniques that have multiple variations that can be used to create copies of data. Backup refers to copies that are created in response to a specific point in time or a specific event ensuring that there is a secondary copy of critical data; providing a consistent point from which to recover the data. At that specific time or because of a specific event all the data is essentially re-written to another source. Backup is well-established and tape backup has been the historic cornerstone of data protection. Tape backup had two notable qualities; it was inexpensive and it was portable. Backup tapes could be transferred and stored in an alternate location for a nominal cost.
These backup copies can also be versioned, which refers to creating copies in response to an event, such as when a file has changed significantly. Versioning also refers to managing multiple point-in-time copies of data sets throughout their lifecycle. Versioning is used to minimize recovery time by increasing the number of intermediate checkpoints from which the data can be recovered. File versioning products can be thought of as providing an “undo” function at a file level.
Continuous Data Protection (CDP) refers to a class of mechanisms that continuously capture or track data modifications, enabling recovery to any previous points in time. Technologies referred to as “near CDP”, which take frequent snapshots are included in the Versioning above. This category includes “true CDP” solutions which enable recovery of a data set as it existed at any previous point in time.
Data Replication refers to a process used to continuously or periodically maintain a secondary copy of data. Data protection is just one of its many uses.
A point-in-time copy can also be classified according to whether it is a Full Copy (also known as clones or frozen images) or a Changed Block Copy (also known as delta copies or pointer copies) as well as by the method used to create the copy. Full copies are created by using Split Mirror (copy performed by synchronization; Point in Time (PiT) is split time) or Copy On First Access (PiT is replica initiation; copy performed in the background) techniques. Changed block copies are created using Copy on Write (original data is used to satisfy read requests for both the source copy and unmodified portions of the replica; updates to the replica are made to a save area) or Pointer Remapping (pointers to the source and replica are maintained; updated data is written to a new location and pointers for that data is remapped to it) techniques. Note that while the term Snapshot is used by many companies as a synonym for a changed block copy, snapshot may also be used to mean any type of point-in-time copy, including full copies; in fact, this is how SNIA defines the term. SNIA uses the term “delta snapshot” to refer to changed block copies.
Replicas may be classified by the distance over which replication is performed (local or remote) as well as where the replication is performed (in the host, network, or storage array).
Synchronous replication is used at local distances where zero RPO is a requirement. Synchronous replication is a technique in which data is committed to storage at both the primary and secondary locations before the write is acknowledged to the host. Asynchronous replication is a technique in which data is committed to storage at only the primary location before the write is acknowledged to the host; data is then forwarded to the secondary location as network capabilities permit. Asynchronous replication may be used to replicate data to “remote” locations, where network latency would adversely impact the performance of synchronous replication.
Host-based replication includes logical volume mirroring, file system snapshots, file synchronization capabilities that support one-to-one, one-to-many, and many-to-one configurations, and log shipping. Log shipping is the process of automating the backup of a database and the transaction log files on a production server, and then restoring them onto a standby server. Network-based replication includes storage virtualization solutions that support cloning. Storage array-based replication solutions are the most popular.
THE AMOUNT OF DATA TO PROTECT
There are certainly lots and lots of digital data today. This ever-increasing avalanche of digital overwhelms an organization’s ability to manage and protect the data given as well as increase the complexity. In their book Abundance, Peter H.Diamandis and Steven Kotler, put the enormous size of digital data into perspective; “If we digitized every word that was written and every image that was made since the beginning of civilization to the year 2003, the total would come to five exabytes. By 2013, we will be producing five exabytes of data every 10 minutes.”
This sheer volume of data creates issues for business continuity, especially relative to traditional backup methods for data protection. Given the very large data volumes, traditional backup will simply not meet the restore-times needed to support technology-intensive organizations in the event of a service interruption. The growth of disk-to-disk, disk-to-disk-to-tape, virtual tape, incremental backups and data de-duplication techniques have all provided some alleviation to the pressure on shrinking ‘backup’ window problem, i.e. how to complete a point-in-time recovery-point while an application is ‘online 24 hours a day; everyday’ . Most backup solutions were focused on addressing the front-end issue of the ‘backup-window’, how to complete a backup. These solutions do little to address the flipside of the issue; how to restart a business operation quickly in the event of an interruption. The time to recover data starts to become a limiting concern. While keeping a valid copy of the data is vital, accessing the data is critical to sustaining the operations of the organization.
It doesn’t matter whether you can’t access data because there isn’t a valid copy of the data or you can’t access the data during a protracted restore process, the consequence to the organization is the same; unavailability. All that matters to the business is that the data can’t be accessed. So rather than how do we finish the backup on time, the goal for IT should be how to restart business operations as quickly as possible; the focus should be on the recovery-time and not the time it takes to backup the data. The critical problem is all about restoring access to the data as quickly as possible, not just backing up the data. While backing up is a key component; restoring business operations is the primary objective. Missing a backup might be a hardship; however missing a recovery would really be a disaster.
Given some of the solutions implemented to address the front-end time constraints, restoring the data could take more than ten times longer to restore the various copies of data. Why so time intensive? Because it takes time to locate all the various incremental backups and then it takes additional time to organize and synchronize them to get a consistent point-in-time view.
So backing up everything is no longer practical and might not be the best option because it takes too long to restore from full copies of the data. Maintaining an identical duplicate may be a better choice; simply write the data twice; at the same time. No need to restore. The trade off is that replicating data is much more expensive in terms of additional workload and storage capacity. Portability becomes an obstacle as well which also has related cost implications.
Whose Data Is It?
It is estimated that 80% of all potentially usable business information originates in an unstructured form. Unstructured data simply means data that's not in a structured data model like a Relational database. This includes Microsoft Office documents, photos, music, video files, log files, etc. Companies have thousands upon thousands of digital photos and videos, Word documents, Excel spreadsheets and PowerPoint files floating around, and usually in multiple copies. For most organizations the challenge in protecting unstructured data is its lack of identity.
IT is often tasked with data governance objectives to protect the data, reduce risk, reduce costs, improve efficiency and achieve compliance. The main challenge realizing these objectives is that data ownership characteristics and usage patterns are non-existent. Seemingly simple choices to protecting data such as permission to view a confidential file or identifying files critical for recovery is nearly impossible without some basic information. To make informed choices about data protection it necessary to know the data’s heuristics, the owners, and how the data is being used.
Protecting data has never been more important. Data is one of three irreplaceable corporate assets, along with the other two intangible assets of life and time. That said, not all data is created equal nor is every piece of data irreplaceable. Data, like risk, comes in gradations. And like risk, the cost of copying data needs to be in balance with the benefits that it provides. After all the only rational reason to spend money is that there is a benefit.