Nearline 2.0 vs. the Archive

In his most recent SAND blog post, Richard introduced the notion of “Nearline 2.0” and discussed how this concept, and related best practices, can be of vital importance to businesses dealing with the “data tsunami” we’ve been experiencing in recent years.

In this post, I’d like to step back a moment and explore the ways in which the dynamics of Nearline 2.0 differ from traditional methods of data archiving in terms of their approach to keeping data warehouse size under control.

Putting Your Database “on a Diet”

Arthur Ritchie

Faced with massive and continually increasing growth in data volumes, data warehouse administrators have come up with a number of techniques designed to maintain acceptable warehouse performance. These include pre-building aggregates and Key Performance Indicators (KPI’s) from large amounts of detailed transaction data, and indexing as many columns as possible in order to speed up query processing. As data warehouses continue to grow, however, the time required to do all the necessary preprocessing of data increases to the point where these tasks can no longer be performed within available “batch windows” when the warehouse is not being accessed by users. So, trade-offs need to be made. Doing less preprocessing work reduces the required time, but also means that queries that depend on aggregates, KPIs or additional indexes may take an inordinately long time to run, and may also severely degrade performance for other users as the system attempts to do the processing “on the fly”. This impasse leads to two possible choices: either stop providing the analytic functionality – making the system less valuable, and users more frustrated, — or “put the database on a diet” by moving some of the data it contains to another location.

Both Nearline 2.0 and archiving solutions can help trim down an over-expanded database: these allow substantial reduction of database size through implementation of an Information Lifecycle Management (ILM) approach, where unused or infrequently used detailed transactional data is removed from the online database and stored elsewhere. When the database is smaller, it will perform better and be capable of supporting a wider variety of user needs. Aggregates and KPI’s will be built from a much smaller amount of detailed transaction data. Additionally, column indexing will be more practicable as there will be fewer rows per column to be indexed. The natural side effect is, of course, that there is much less data to be analyzed and compared.

Getting “Lean” Not “Mean”

There are a number of important differences between archiving warehouse data (using products from Open Text, EMC Documentum, and so on) and storing it in Nearline 2.0 (using SAND/DNA). However, since both types of product are used to hold data that has been moved out of the main “online” system, it is unclear to some why one would need to be implemented if the other is in place. To help clarify why one or the other type of system (or both) might be required in a given situation, it is worthwhile to go over the major points of contrast between Nearline 2.0 data and archived data.

Online, Nearline 2.0, and Archive

Archive

Normally, the concept of electronic archiving focuses on the preservation of documents or data in a form that has some sort of certifiable integrity (for example, conformity to legal requirements), is immune to unauthorized access and tampering, and is easily subject to certain record management operations within a defined process – for example, automatic deletion after a certain period, or retrieval when requested by an auditor. The archive is in fact a kind of operational system for processing documents/data that are no longer in active use.

The notion of archiving has traditionally focused on unstructured data in the form of documents, but similar concepts can be applied to structured data in the warehouse. An archive for SAP BI, for example, would preserve warehouse data that is no longer needed for analytical use but which needs to be kept around because it may be required by auditors, as would be the case if SAP BI data were used as the basis for financial statements. The archive data does not need to be directly accessible to the user community, just locatable and retrievable in case it is required for inspection or verification – not for analysis in the usual sense. In fact, because much of the data that needs to be preserved in the archive is fairly sensitive (for example, detailed financial data), the ability to access it may need to be strictly regulated.

While many vendors of archiving solutions stress the performance benefits of reducing the amount of data in the online database, accessing the archived data is a complicated and relatively slow process, since it will need to be located and then restored into the online database, or accessed directly in a much slower backup data base that is not readily maintained from a performance or accessibility perspective. For this reason, it is unrealistic to expect archived data to be usable for analysis/reporting purposes.

Nearline 2.0

In the Information Lifecycle Management approach, the Nearline 2.0 repository holds data that is used less frequently than the “hottest”, most current data, but which still needs to be readily available for analysis or for constructing new/ revised analytic objects for the warehouse to evaluate emerging trends.

While the exact proportion of Nearline 2.0 to online data will vary, the amount of “less frequently used” data that needs to be kept available is normally quite large. Moving this out of the main database greatly reduces the pressure on the online database and enables continued performance of standard database operations within available time windows, even in the face of the explosive data growth that many organizations are currently facing.

Thus, the archiving requirements described above do not apply to a Nearline 2.0 product such as SAND/DNA, which is designed to reduce the size of the online warehouse database, while at the same time keeping the data more or less transparently accessible to end users who may need to use it for analysis, for rebuilding KPI’s and so on.

About SAND Technology

SAND Technology provides Data Management Software and Best Practices for storing, accessing, and analyzing large amounts of data on-demand while lowering TCO, leveraging existing infrastructure and improving operational performance.

SAND/DNA solutions include CRM analytics, and specialized applications for government, healthcare, financial services, telecommunications, retail, transportation, and other business sectors. SAND/DNA has achieved "Certified for SAP NetWeaver" status and SAND Nearline Integration Controller has achieved "Powered by SAP NetWeaver" status.

SAND Technology has offices in the United States, Canada, the United Kingdom and Central Europe.