Integrated Problem Diagnosis and Repair in Databases and Storage Area Networks

Summary         People         Papers         Demos        

Amulet and DIADS, and now, Flex

We are developing two systems at Duke---Amulet and DIADS---to simply the integrated management of (i) database systems, and (ii) the storage systems that provide persistent storage for these database systems. This project is supported generously by NSF Award 0917062, startup funds from Duke, and a faculty award from IBM.

Flex: Flex is our recent attempt to bring the features of Amulet and DIADS, and much more, into a single system.

Amulet: Occasional corruption of stored data is an unfortunate byproduct of the complexity of modern systems. Hardware errors, software bugs, and mistakes by human administrators can corrupt important sources of data. A tragic story here involves the social-bookmarking site which experienced an instance of file-system corruption that spread silently to data backups. Only when disaster struck did the administrators realize that data stored by most users was lost irretrievably. went out of business shortly after the problem happened. We know a case where an enterprise DBMS kept crashing when database administrators (DBAs) attempted to bring it online from backups that happened to be corrupted; and the system was unavailable for a week before the problem was fixed manually. More recently, corruption of four files in an Oracle database at JPMorgan Chase caused a severe outage that left customers stranded, and blocked about $132 Million in financial transactions. Amazon S3, Apache CouchDB NoSQL database, Berkeley DB, MySQL, PostgreSQL, and many others have caused such corruption of valuable data.

The dominant practice to deal with data corruption today involves administrators writing ad hoc scripts that run data-integrity tests at the application, database, file-system, and storage levels. This manual approach is tedious, error-prone, and provides no understanding of the potential system unavailability and data loss if a corruption were to occur.

The Amulet system addresses the problem of verifying the correctness of stored data proactively and continuously. To our knowledge, Amulet is the first system that: (i) gives administrators a declarative language to specify their objectives regarding the detection and repair of data corruption; (ii) contains optimization and execution algorithms to ensure that the administrator's objectives are met robustly and with least cost, e.g., using pay-as-you cloud resources; and (iii) provides timely notification when corruption is detected, allowing proactive repair of corruption before it impacts users and applications. We are implementing and evaluating Amulet for a database software stack deployed on an infrastructure-as-a-service cloud provider like Amazon Web Services.

DIADS: Databases are typically used as a subsystem in a larger system that contains Web servers, application servers, and network-attached storage servers. Such complex systems experience some form of change all the time, e.g., an update to a Java module in the application server, a statistics update in the database, or a RAID rebuild in a storage volume. Changes in different subsystems can cause an overall performance degradation whose cause is hard to diagnose. The diagnosis task is all the more daunting because enterprise environments have isolated administration teams and tools for each subsystem.

DIADS is an integrated tool that automates complex administrative tasks like problem diagnosis, what-if analysis, orchestrating disaster recovery, and online tuning when a database is used as a subsystem in a larger system. DIADS contains two technical innovations. Problem diagnosis involves reconstructing system behavior at various points of time using historic and current monitoring data collected from the system. However, the amount and quality of monitoring data available from production systems is constrained by the need to keep monitoring overhead low. DIADS uses an abstraction called Annotated Plan Graph to represent and reason about database behavior in the context of a larger system. Annotated Plan Graphs are generated from light-weight monitoring data.

The other innovation in DIADS is a suite of workflows for administrative tasks that combine machine-learning techniques with domain knowledge from system experts. For example, for problem diagnosis, the machine-learning part of the workflow provides core techniques to handle large and noisy streams of monitoring data, while the domain-knowledge part acts as checks-and-balances to guide the diagnosis in the right direction. This unique design enables DIADS to function effectively even in the presence of multiple concurrent problems as well as noisy monitoring data prevalent in production environments. DIADS is being prototyped for research and educational purposes in a datacenter setting with PostgreSQL databases and an enterprise-level storage area network.

Primary Project Members