Home Page of the Ques Project

Ques

Summary Details People Papers Demos

The increasing complexity, scale, and dynamics of networked computing systems make it hard for users and system administrators to understand and control these systems. Recent studies indicate that a significant fraction of user time gets wasted because of unexpected system slowdowns, crashes, and application errors. Business-critical systems often have hundreds of components---e.g., applications, databases, servers, routers---whose performance depend on thousands of intricate and time-varying dependencies and parameters. The Ques project aims to arrest and reverse the dangerous spiral towards unwieldy systems, high administrative costs, and frustrated users.

Ques is supported generously by NSF CAREER Award Number 0644106, startup funds from Duke, three faculty awards from IBM, and an equipment grant (jointly with three other Duke faculty members) from IBM.

Project Details

Ques tackles system management through innovative data management. Ques treats a computing system as a rich source of data about system configuration and activity, available typically as continuous, rapid, and time-varying data streams. The system data---e.g., multidimensional time-series of performance and utilization metrics, control and data-flow paths of requests, and error messages---is collected in an efficient and controlled fashion. Ques gives users and administrators the ability to pose a broad range of system-management queries over this data:

Health monitoring, e.g., which applications cause the most I/O?
Change (anomaly) detection, e.g., alert me when resource-usage patterns change significantly.
Diagnosis, e.g., why is it taking 10 seconds to add an item to a shopping cart?
Forecasting, e.g., what will the system throughput be 1 hour from now?
What-if, e.g., how will processor utilization change if the database cache size is increased by 10%?
Recommendation, e.g., what resource allocation to a hurricane-prediction workflow will guarantee its completion within 30 minutes?

Ques-Querying addresses challenges in developing simple and intuitive ways to express such queries---e.g., using a visual interface, declarative query language, or keyword search---and processing the queries automatically and efficiently using execution plans. These plans use statistical (e.g., neural network) and performance (e.g., queuing network) models learned from system data as well as operators for data transformation (e.g., feature selection) and inference. We have developed algorithms to navigate the huge plan space comprising models, model-parameters, and transformations quickly using techniques like cost estimation---estimating plan accuracy and execution time using statistics---and active-learning---executing sample plans for learning purposes.

Ques-Control is an ambitious next step to Ques-Querying to enable automated control of complex computing systems under changing conditions, based on policies specified by system administrators. Like Ques-Querying, Ques-Control learns models of system behavior from data collected passively or through active perturbation. Given a set of system policies P, Ques-Control derives a controller---an execution plan based on sensing, actuation, and feedback---to enforce P always. Ques-Control poses interesting challenges in policy-interface design, acquiring the right training data to model specific system behavior quickly, robustness to bursty workloads, and proactive system tuning.

Ques seeks to advance the state of the art in our ability to understand and control computing systems in a number of ways:

No current system-management product supports Ques's broad range of queries or their combinations: health monitoring, anomaly detection, diagnosis, forecasting, what-if, and recommendation. Furthermore, Ques targets a comprehensive long-term solution for system management by automating the generation of plans for executing queries and enforcing policies. This approach requires extensive collection and analysis of system configuration and activity data---e.g., performance metrics, resource utilization, execution and stack traces, error messages, workload, network packets, source code, and help manuals---both passively and through controlled system perturbation.
Busy data centers generate more than 1 Terabyte of log data per day. More fine-grained logging can make this size 100x larger. To query such massive time-varying datasets, Ques is pushing the envelope of data-stream technology where data is modeled as continuous streams and queried using a "you-get-one-look" approach. In addition, Ques supports controlled data collection to balance the inherent cost-accuracy tradeoff.
Ques is removing technical barriers to automated policy-driven control of computing systems. Recent industrial initiatives like IBM's Autonomic Computing and Microsoft's Dynamic Systems Initiative highlight the pressing need for such control.
Current system-management products are usually of little help to desktop users facing unexpected system slowdowns or misbehaving applications. Ques makes system management accessible to the large and diverse class of users and developers who administer their own systems.
Current system-management products have fairly rigid interfaces and require a lot of system expertise to use. Ques is rethinking the system-management interface for a broad spectrum of potential users. As one example, system administrators need effective ways to input domain knowledge, while desktop users prefer keyword queries on a personalized engine for desktop and Internet search.

We are committed to building a fully-functional prototype of Ques and deploying it in real-world settings. With each novel component of Ques, we will: (i) perform the research and evaluation using a prototype in a testbed setting, with both synthetic and real applications and data, (ii) demonstrate the prototype at a leading conference, (iii) make the demonstration available publicly on the Internet, (iv) do a real-world deployment and user studies if there is sufficient interest, and (v) release the source code publicly. The effectiveness of Ques will be tested by deploying it to manage workloads on a virtualized, service-oriented, and on-demand computing platform on our departmental research-computing cluster. We have also had encouraging preliminary discussions with the administrators of an university-wide production cluster used heavily for computational-science applications. We have established industrial collaborations (IBM) with the eventual goal of transferring technology from Ques to industrial-strength system-management products.

Project Members

Shivnath Babu, Associate Professor, Duke Computer Science
Botong Huang, Ph.D. Candidate, Duke Computer Science
Harold Lim, Ph.D. Candidate, Duke Computer Science
Jie Li, Ph.D. Candidate, Duke Computer Science

Collaborators

Ashraf Aboulnaga, Associate Professor, Computer Science, University of Waterloo
Jeff Chase, Professor, Duke Computer Science
Brent Miller, Autonomic Computing Group, IBM
Kamesh Munagala, Associate Professor, Duke Computer Science
Sandeep Uttamchandani, IBM Almaden Research Center
Jun Yang, Associate Professor, Duke Computer Science

Alumni

Nick Bodnar, Worked on Ques as an undergraduate student
Nedyalko Borisov, Worked on Ques as a Ph.D. student, First employment at Facebook
Garrett Bressler, Worked on Ques as a high-school student, joined Brown University
Brian Cook, Worked on Ques as an M.S. student, joined IBM
Songyun Duan, Worked on Ques as a Ph.D. student, First employment at IBM Research
Peter Franklin, Worked on Ques as an undergraduate student, First employment at Microsoft
Herodotos Herodotou, Worked on Ques as a Ph.D. student, First employment at Microsoft
Jack Li, Worked on Ques as an undergraduate student
Piyush Shivam, Worked on Ques as a Ph.D. student, First employment at Sun Microsystems
Jonathan Su, Worked on Ques as an undergraduate student
Vamsidhar Thummala, Ph.D. Candidate, Duke Computer Science
Dongdong Zhao, Worked on Ques as an M.S. student

Publications

N. Borisov and S. Babu. Rapid Experimentation for Testing and Tuning a Production Database Deployment
In Proc. of the International Conference on Extending Database Technology (EDBT), March 2013

H. Lim, Y. Han, and S. Babu. How to Fit When No One Size Fits
In Proc. of the Sixth Biennial Conference on Innovative Data Systems Research (CIDR), January 2013

R. Thonangi, S. Babu, and J. Yang. A Practical Concurrent Index for Solid-State Drives
In Proc. of CIKM, October 2012

H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for MapReduce Workflows
In Proc. of PVLDB 5(11), August 2012

H. Herodotou, F. Dong, and S. Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
In Proc. of the ACM Symposium on Cloud Computing 2011 (ACM SOCC 2011), October 2011

H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
In Proc. of the 2011 Intl. Conference on Very Large Data Bases (VLDB), August 2011

H. Herodotou, N. Borisov, and S. Babu. Query Optimization Techniques for Partitioned Tables
In Proc. of the 2011 ACM Intl. Conf. on Management of Data (SIGMOD), June 2011

H. Herodotou and S. Babu. Xplus: A SQL-Tuning-Aware Query Optimizer
In Proc. of PVLDB Volume 3 (the International Conference on Very Large Databases (VLDB)), September 2010

S. Babu. Towards Automatic Optimization of MapReduce Programs
In Proc. of the ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010), June 2010

H. Lim, S. Babu and J. Chase. Automated Control for Elastic Storage
In Proc. of the Intl. Conference on Autonomic Computing (ICAC 2010), June 2010

S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned
In Proc. of the International Conference on Very Large Databases (VLDB), August 2009

H. Herodotou and S. Babu. Automated SQL Tuning through Trial and (Sometimes) Error
In Proc. of the Second Workshop on Testing Database Systems (DBTest), June 2009

M. Ahmad, A. Aboulnaga, and S. Babu. Query Interactions in Database Workloads
In Proc. of the Second Workshop on Testing Database Systems (DBTest), June 2009

A. Demberel, J. Chase, and S. Babu. Reflective Control for an Elastic Cloud Application: An Automated Experiment Workbench
In Proc. of the First Workshop on Hot Topics in Cloud Computing (HotCloud), in conjunction with USENIX Annual Technical Conference, June 2009

H. Lim, S. Babu, J. Chase, and S. Parekh. Automated Control in Cloud Computing: Challenges and Opportunities
In Proc. of the First Workshop on Automated Control for Datacenters and Clouds, June 2009

S. Babu, N. Borisov, S. Duan, H. Herodotou, and V. Thummala. Automated Experiment-Driven Management of (Database) Systems
In Proc. of the 12th Workshop on Hot Topics in Operating Systems (HotOS), May 2009

S. Duan, S. Babu, and K. Munagala. Fa: A System for Automating Failure Diagnosis (full paper)
In Proc. of 2009 IEEE International Conference on Data Engineering (ICDE), April 2009

S. Babu, N. Borisov, S. Uttamchandani, R. Routray, and A. Singh. DIADS: Addressing the "My-Problem-or-Yours" Syndrome with Integrated SAN and Database Diagnosis
In Proc. of the USENIX Conference on File and Storage Technologies (FAST), February 2009

N. Borisov, S. Uttamchandani, R. Routray, and A. Singh. Why Did My Query Slow Down?
In Proc. of the 2009 Conference on Innovative Data Systems Research (CIDR), January 2009

S. Duan and S. Babu. Empirical Comparison of Techniques for Automated Failure Diagnosis
In Proc. of the Third Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), December 2008

M. Ahmad, A. Aboulnaga, S. Babu, and K. Munagala. Modeling and Exploiting Query Interactions in Database Systems
In Proc. of ACM International Conference on Information and Knowledge Management (CIKM), October 2008

R. Thonangi, V. Thummala, and S. Babu. Finding Good Configurations in High-Dimensional Spaces: Doing More with Less
In Proc. of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2008

S. Babu. Grand Challenge: Experiment-driven Adaptive Systems
Vision paper written for invitation to the third Workshop on Hot Topics in Autonomic Computing (HotAC III), June 2008

P. Shivam, V. Marupadi, J. Chase, and S. Babu. Cutting Corners: Workbench Automation for Server Benchmarking
In Proc. of the 2008 USENIX Annual Technical Conference, June 2008

S. Duan and S. Babu. Guided Problem Diagnosis through Active Learning
In Proc. of the International Conference on Autonomic Computing (ICAC), June 2008

S. Babu, S. Duan, and K. Munagala. Processing Diagnosis Queries: A Principled and Scalable Approach
Poster at the International Conference on Data Engineering (ICDE), April 2008.

M. Ahmad, A. Aboulnaga, S. Babu, and K. Munagala. QShuffler: Getting the Query Mix Right
Poster at the International Conference on Data Engineering (ICDE), April 2008.

S. Duan and S. Babu. Processing Forecasting Queries
In Proc. of the International Conference on Very Large Databases (VLDB), September 2007

B. Chandramouli, C. Bond, S. Babu, and J. Yang. Query Suspend and Resume
In Proc. of the 2007 ACM Intl. Conf. on Management of Data (SIGMOD), June 2007

A. Yumerefendi, P. Shivam, D. Irwin, P. Gunda, L. Grit, A. Demberel, J. Chase, and S. Babu. Towards an Autonomic Computing Testbed
In Workshop on Hot Topics in Autonomic Computing (HotAC), June 2007

B. Cook, S. Babu, G. Candea, and S. Duan. Towards Self-Healing Multitier Services
In Second Intl. Workshop on Self-Managing Database Systems (SMDB), April 2007

B. Chandramouli, C. Bond, S. Babu, and J. Yang. On Suspending and Resuming Dataflows (poster).
In Proc. of IEEE International Conference on Data Engineering (ICDE), April 2007

P. Shivam, S. Babu, and J. Chase. Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications
In Proc. of the International Conference on Very Large Databases (VLDB), September 2006

P. Shivam, S. Babu, and J. Chase. Active Sampling for Accelerated Learning of Performance Models
In Proc. of the First Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), June 2006

P. Shivam, S. Babu, and J. Chase. Learning Application Models for Utility Resource Planning
In Proc. of IEEE International Conference on Autonomic Computing (ICAC), June 2006

S. Babu, P. Bizarro, and D. DeWitt. Proactive Re-optimization
In Proc. of the 2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005
The Rio system described in this paper was demonstrated at SIGMOD 2005, June 2005

S. Babu and P. Bizarro. Adaptive Query Processing in the Looking Glass
In Proc. of the Second Biennial Conference on Innovative Data Systems Research (CIDR), January 2005

Demonstrations

H. Lim and S. Babu. Execution and Optimization of Continuous Queries with Cyclops.
Demonstrated at the 2013 ACM Intl. Conf. on Management of Data (SIGMOD 2013), June 2013

H. Herodotou, F. Dong, and S. Babu.
MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish.
Demonstrated at the 2011 International Conference on Very Large Data Bases (VLDB), August 2011

V. Thummala and S. Babu. A Tool for Configuring and Visualizing Database Parameters
Winner of the SIGMOD'10 Best-demo Award Competition!
Demonstrated at the 2010 ACM Intl. Conf. on Management of Data (SIGMOD 2010), June 2010

S. Duan, P. Franklin, V. Thummala, D. Zhao, and S. Babu. Shaman: A Self-Healing Database System
Demonstrated at the 2009 IEEE International Conference on Data Engineering (ICDE), April 2009

S. Duan and S. Babu. Automated Diagnosis of System Failures with Fa
Demonstrated at the 2009 IEEE International Conference on Data Engineering (ICDE), April 2009

P. Shivam, A. Demberel, P. Gunda, D. Irwin, L. Grit, A. Yumerefendi, S. Babu, and J. Chase.
Automated and On-Demand Provisioning of Virtual Machines for Database Applications
Demonstrated at the 2007 ACM Intl. Conf. on Management of Data (SIGMOD 2007), June 2007

S. Duan and S. Babu. Proactive Identification of Performance Problems
Demonstrated at the 2006 ACM Intl. Conf. on Management of Data (SIGMOD 2006), June 2006

S. Babu, P. Bizarro, and D. DeWitt. Proactive Re-optimization with Rio
Demonstrated at the 2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005