Intelligent Archive Needed

With all of the terminology thrown around whenever the topic of data management is discussed, it is easy to lose sight of the key ingredient of an effective data management practice:  intelligent archive.  Intelligent archive is the centerpiece of any data management strategy, but it finds itself awash in a morasse of confusing vendor-speak. 

For a variety of business reasons, companies must hold on to certain data for a period of time based on business context, regulatory mandates, or for reasons of historical value.  Intelligent archive is the solution.

Note the modifier, "intelligent."  It is important.  Archives are not assemblages of anonymous data.  They are not simply a copy of backup tapes.  Intelligent archive connotes a system by which specific data is selected and written to appropriate media, then stewarded over the years - migrated between different storage media and refreshed or transformed to work with contemporary software - until the data can be safely deleted.

The argument could be made with high confidence (and lots of empirical statistics) that no other technology strategy has as much potential to deliver real business value – in terms of cost-savings, risk reduction and improved productivity – than does archive. And no other strategy can offer as much technical ROI, whether measured from the standpoint of application performance improvement or machine/network/storage resource utilization efficiency, than can a well-defined program of active data archiving.

That is one reason why we watched with interest the recent formation of a new industry initiative called the Active Archive Alliance, which was launched by a collection of vendors including Spectra Logic, FileTek, Compellent, and QStar Technologies. Aside from the fact that the organization appears to be a true collaboration of otherwise competing vendors (something rare these days), it stood out for historical reasons.  When last quoted on this topic, the Storage Networking Industry Association (SNIA) was both promoting archive (having launched an Information Lifecycle Management forum) and discouraging firms from pursuing such a strategy with survey findings in 2007 claiming that the tools for archive just weren’t there. The Optical Storage Technology Group was making more noise debating the relative merits of HD-DVD and BluRay disc technology than they were addressing the potential application of optical media to long term archive. A tape vendor collaboration orchestrated by Quantum was gobbled up by SNIA because the latter didn’t want the competition for sponsorship dollars. And EMC was running an expensive “shatter the platter” campaign to drive folks from optical storage to their burgeoning content addressable storage (CAS) platform, Centera.

No one was speaking for archive technology specifically. In the vacuum, old concepts became etched in stone. How often did we hear the truism that archive can’t work in distributed computing environments because (1) the requisite network bandwidth isn’t there to transport data between disk and tape/optical storage in an efficient way, (2) storage targets aren’t shared, and (3) users control the data and don’t participate in classification schemes that could be used to write archive policies? Despite the fact that these truisms are no longer necessarily true, they are still widely held beliefs. Lack of coherent messaging around archive has seen the term “archive” confused with everything from “tiered storage architecture” to “hierarchical storage management” to “information lifecycle management” (not the architecture, but rather the marketing concept proffered by EMC in the early Augties). Clearly, this cloud of confusion and misperception needs to be addressed by someone, and the Active Archive Alliance seems poised to make it part of their mission. To help them get started, we offer the following clarifications.

Tiered Storage is Not Archive

Tiered storage architecture traces back to the earliest days of mainframe computing: data was first staged to memory, a precious and expensive commodity, and was quickly migrated to disk based subsystems (Direct Access Storage Devices or DASD). Given the  small capacities offered by refrigerator-sized DASD, data was migrated off of disk and onto tape as rapidly as possible. Each storage device or set of devices constituted a distinguishable tier of storage, discriminated in part by the I/O performance of the device.

In distributed systems, storage tiers have aligned with the price/performance/capacity of different storage products. Vendors offer “enterprise” arrays that feature high speed/low capacity/high cost disk costing from $80 to $180 per GB. These products are referred to as “tier one” storage in the industry vernacular and they usually feature on-array “value add” software functionality such as RAID, mirror splitting, array-to-array replication, and other functions that enable the vendor to sell a finished array (essentially a box of commodity disk) at a significant mark-up over the cost of the disks and chassis alone.

Tier 2 storage constitutes arrays featuring lower performance/high capacity/lower cost disk. Touting a price point of between $40 and $140 per GB, these arrays are designed to provide mass storage capacity for less frequently accessed data. Tier 3 storage usually refers to tape or optical disc technologies, where media costs hover between $.44 and $1.50 per GB. Access speeds are significantly slower than either Tier 1 or Tier 2 devices, so they are primarily used to host historical archive or backup data.

Recently, the tiered storage metaphor has been revisited by leading hardware vendors to provide a backdrop to discuss both “hybrid disk systems” (that blend Tiers 1 and 2 or Tiers 2 and 3 in the same box) and to introduce “Tier 0” technologies, as solid state disk (SSD) has come to be termed. Hybrid disk systems combine different drive technologies in the same chassis and implement a typically simplistic form of hierarchical storage management (HSM) software in the array controller, which serves both to increase the costs of the drives in the array (and the profit to the vendor) and to establish some rudimentary mechanism for moving data over set time intervals between the different disk types in the array. (We discuss HSM below in detail.)

Solid State Disc (SSD) is increasingly an offering with “enterprise” arrays, especially as low cost Flash memory is leveraged to reduce the price of SSD from thousands of dollars per GB to hundreds of dollars per GB. As in the earlier case of mainframe memory, SSDs are so costly, data generally needs to be pushed out of these components and on to Tier 1 or 2 as quickly as possible, requiring a simple HSM algorithm.

Bottom line: Tiered storage is not a new concept, nor is it in any way the equivalent of archive or intelligent data management. It is simply a storage architecture model: a way of interconnecting various storage devices across which data management processes can operate. Unfortunately, it is misrepresented by some vendors as an alternative to managing data and archiving it according to its business context.

Hierarchical Storage Management is Not an Archive

Hierarchical Storage Management (HSM) is, as suggested above, a software-based technology intended primarily to support the efficient allocation of tiered storage capacity. The focus of HSM traditionally has been on capacity allocation efficiency, not capacity utilization efficiency, which is the goal of intelligent data management and archive. The difference is important.

Capacity allocation efficiency refers to the balancing of storage assets so that one set of disks does not become fully populated with data, introducing performance degradation, while another set of disks is only sparsely used to host data. This concept has been extended to refer to the placement of data on spindles that provide necessary performance or services (such as replication) required for that data, and alternatively as a methodology for migrating older data to lower tiers to free up space on more expensive and high performance Tier 1 storage.

In most cases, HSM delivers capacity allocation efficiency without reference to the business context of data itself. The preponderance of HSM algorithms key to three factors: how full is the disk, how old is the data, and when was the data last accessed or modified.

HSM that uses the first criterion, allocated capacity measurement, is sometimes called “watermark-based HSM.” The operational premise is simple: when data amasses to a particular level or watermark, this triggers the migration of certain data to lower tier media. Often, the methodology for selecting which data to move is FIFO (first in, first out). Data that has occupied the media the longest gets moved.

HSM using the second criteria, file time stamps, may operate in concert with watermark/FIFO HSM systems. The trigger for data movement, however, in a pure time stamp HSM process is the age of the file itself. At a given time, for example, all files older than 30 days are automatically migrated to lower tiers.

HSM keyed to the third criteria, date last accessed/date last modified, leverage file metadata that store last accessed and last modified dates to determine what files to move. For example, any file whose date last accessed/modified date in metadata is older than 60 days is automatically moved to lower tiers of storage.

All three varieties of HSM functionality exist in the market today, either as standalone software or as value-add software built on to enterprise array controllers featuring on-array tiering. These are deceptively marketed as intelligent data management solutions, which they are not. Their focus is less on data itself than on capacity allocation in the array or across storage infrastructure. Little if any attention is paid to the contents of the data files that are being moved or their business context, which must ultimately determine how data is hosted and what services are provided to the data during its useful life. Capacity utilization efficiency – placing the right data on the right device at the right time, and exposing it to the right set of services based on business value criteria – is simply not the goal of HSM.

Storage Resource Management is Not Archive

Like HSM, Storage Resource Management (SRM) is not a substitute for intelligent data management. Like HSM, SRM focuses on capacity allocation efficiency. A survey of SRM product whitepapers and marketing materials reveals that storage resource management means different things to different vendors.

Most SRM products, for example, provide tools for managing and monitoring storage device configurations, connectivity, capacity, and performance. Some SRM products provide tools for designing, monitoring and maintaining replication processes intended to protect data, including backup and various forms of local and remote mirroring. Still other SRM products provide tools for implementing processes to automate such routine management tasks as can be automated efficaciously in order to reduce labor cost components in IT administration.

SRM vendors do offer a range of reports that may be useful in exploring data repositories and for identifying candidates both for deletion (in the case of contraband files identifiable by their file extensions) and for migration (identifying last access/last modified metadata). They are very useful in identifying how data is currently laid out on infrastructure and for spotting “hot spots” or other burgeoning conditions that may be impairing access speeds or application performance. Some provide tools to “watch” the on-array algorithms governing value-add features like thin provisioning or de-duplication – often to provide consumers with greater oversight and confidence in these technologies.

Unfortunately, many leading SRM vendors misrepresent the capabilities of their products as data management products. Intentionally or not, they state that SRM delivers the utilization efficiency that is often missing from storage, when the correct verbiage might be “allocation efficiency” or “operational efficiency” of the infrastructure. As the name implies, Storage Resource Management is about managing storage resources, not data.

Information Lifecycle Management is Not Archive

Perhaps the greatest damage ever done to intelligent data management and archive was the marketing around Information Lifecycle Management (ILM) promulgated by leading storage array vendors in the late 1990s and early Aughties. ILM is an old idea, again tracing its origins back to mainframes. IBM correctly asserted that ILM involved, at a minimum, four things.

For one, you needed a means to classify data assets (so we would understand what to move around infrastructure). Second, you needed a method for classifying storage assets (so we would know the targets to which classified data could be moved). Third, true ILM required a Policy Engine that would describe what data to move and under what circumstances or conditions. Finally, you needed a data mover – software that would move the data physically from device to device based on policy.

When EMC resurrected the ILM concept in the late 1990s, a marketing barrage that found many competitors pitching the same functionality, they ignored the first three parts of what IBM defined as ILM and instead proffered machine processes for data movement: basically, HSM. What resulted came to be criticized as “information Feng Shui management” since it provided no support for data classification, storage classification or policy-driven management – the “heavy lifting” of any ILM process. True ILM is synonymous with intelligent data management, but ILM as represented in most vendor marketing slicks is not true ILM. It is instead analogous to electronic delivery of tax returns: a few years ago, the IRS provided a means to transfer completed returns via email, but offered no tools to preparers for sorting out receipts in shoe boxes based on which were deductible expenses (data classification), for deciding which forms to use for which declarations (storage classification), or for determining which returns needed to be retained in an available state for possible review by examiners (policy rules). By analogy, the IRS didn’t provide a complete “ILM” solution: the heavy lifting of tax preparation remains the burden of the preparer.

Archive is part of a true ILM strategy, but it is not ILM. Archive is one of a set of services that need to be availed to data based on a thoughtful assessment of the business context of the data itself and as part of its lifecycle management.

Going Forward

At the end of the day, a mixture of deliberate and inadvertent confusion has entered into the discussion around archive. The Active Archive Alliance is seeking to redress this confusion and to help consumers to address the root causes of high storage cost, out of control file proliferation, and regulatory non-compliance: data mismanagement. Getting there will require a systemic and policy-driven mechanism for data classification, data routing across infrastructure, and ultimately an open and reliable archival repository – preferably an open and standards-based mechanism.

That we haven’t seen any vendor become the industry’s “archive giant” reflects the fact that disinformation around archive is so pervasive, and the educational hurdles to be surmounted prior to making a sale is so daunting, no single vendor is capable or willing to make the necessary investment. Perhaps the Active Archive Alliance can. One thing that might help is the creation of a simple check list of criteria that the consumer can use to make smart archive purchases. This list should include, at a minimum, the following:

  1. The system should enable the classification of data assets in a granular and business-focused manner, ideally at the time of inception or creation of the data.
  2. The system should enable the creation and centralization of policy-based rules governing data classes.
  3. The system should monitor and maintain a consistent view of storage assets and provide a clear understanding of where data assets are positioned on storage infrastructure at any given time.
  4. The system should be capable of establishing, directly or indirectly, the routing of data assets through infrastructure so that data is exposed to (or excluded from) “storage services” for data protection, reduction, access security, encryption, etc. Ideally, it should also enable the routing of data to hosting platforms that accommodate necessary usage characteristics associated with the data in terms of accessibility and performance.
  5. The system should leverage existing infrastructure management capabilities where appropriate, including pre-defined or domain-based access controls whether organized by user or server. Active Directory in the Microsoft environment is an example.
  6. The system should be file system agnostic to the greatest possible extent, supporting data management universally and regardless of the operating system or file system environment.
  7. The system should not compromise the integrity of data files themselves. Processes used to inventory files and to manage their movement should in no way truncate the file header, making it impossible to re-scan the file complex should the file management system itself become compromised. (Some products strip the metadata header from the file and place only the file payload in a repository. The metadata header information is stored in the data management system database, and the payload data cannot be recovered if the data management system database is corrupted. This type of “stubbing” is to be avoided because it puts managed data at risk.)
  8. The system itself should have security features to prevent unauthorized access to or revision of data management policies. Ideally, it should also be designed for use by different data managers, including Governance Risk and Compliance (GRC) managers, business department heads, IT management, or other groups who might have legitimate roles in defining and policing data management policy.
  9. The system should be flexible in its support for both centralized archive and decentralized or federated data management.
  10. As a practical matter, the system should be transparent to users and applications. Repeated case studies have shown that approaches that impose a requirement on end users to become involved in the classification of their own data tend not to be sustainable. Moreover, it should be as automated as possible, enabling policies to be applied quickly to an existing corpus of data and extended to new data rapidly using existing policy templates. A number of approaches, shown at left, have been suggested for the implementation of data management. Automatic classification based on “deep blue math” algorithms remains the holy grail, as does the wholesale replacement of the conventional file system with a database or other organizing metaphor. Today, however, data classification based on user role shows the most promise as a methodology for applying data classification in a way that satisfies the transparency and automation goals of effective data management.

 

Sometimes simple tools can make all the difference.  IT-SENSE.org wishes the Active Archive Alliance good luck.

Do you have a comment on these videos?


To post a comment please login or register on the site.

Site Login