Ques
Project Summary
The increasing complexity, scale, and dynamics of networked computing
systems make it hard for users and system administrators to understand
and control these systems. Recent studies indicate that a significant
fraction of user time gets wasted because of unexpected system
slowdowns, crashes, and application errors. Business-critical systems
often have hundreds of components---e.g., applications, databases,
servers, routers---whose performance depend on thousands of intricate
and time-varying dependencies and parameters. The Ques project
aims to arrest and reverse the dangerous spiral towards unwieldy
systems, high administrative costs, and frustrated users.
Ques is supported generously by NSF CAREER Award Number 0644106, startup
funds from Duke, three faculty awards from IBM, and an equipment grant
(jointly with three other Duke faculty members) from IBM.
Project Details
Ques tackles system management through innovative data management.
Ques treats a computing system as a rich source of data about system
configuration and activity, available typically as continuous, rapid,
and time-varying data streams. The system data---e.g.,
multidimensional time-series of performance and utilization metrics,
control and data-flow paths of requests, and error messages---is
collected in an efficient and controlled fashion. Ques gives users
and administrators the ability to pose a broad range of
system-management queries over this data:
- Health monitoring, e.g., which applications cause the most I/O?
-
Change (anomaly) detection, e.g., alert me when resource-usage
patterns change significantly.
-
Diagnosis, e.g., why is it taking 10 seconds to add an item to a
shopping cart?
-
Forecasting, e.g., what will the system throughput be 1 hour from
now?
-
What-if, e.g., how will processor utilization change if the
database cache size is increased by 10%?
-
Recommendation, e.g., what resource allocation to a
hurricane-prediction workflow will guarantee its completion within 30
minutes?
Ques-Querying addresses challenges in developing simple and
intuitive ways to express such queries---e.g., using a visual
interface, declarative query language, or keyword search---and
processing the queries automatically and efficiently using execution
plans. These plans use statistical (e.g., neural network) and
performance (e.g., queuing network) models learned from system data as
well as operators for data transformation (e.g., feature selection)
and inference. We have developed algorithms to navigate the huge plan
space comprising models, model-parameters, and transformations quickly
using techniques like cost estimation---estimating plan accuracy and
execution time using statistics---and active-learning---executing
sample plans for learning purposes.
Ques-Control is an ambitious next step to Ques-Querying to
enable automated control of complex computing systems under changing
conditions, based on policies specified by system administrators.
Like Ques-Querying, Ques-Control learns models of system behavior from
data collected passively or through active perturbation. Given a set
of system policies P, Ques-Control derives a controller---an execution
plan based on sensing, actuation, and feedback---to enforce P always.
Ques-Control poses interesting challenges in policy-interface design,
acquiring the right training data to model specific system behavior
quickly, robustness to bursty workloads, and proactive system tuning.
Ques seeks to advance the state of the art in our ability to
understand and control computing systems in a number of ways:
- No current system-management product supports Ques's broad range of
queries or their combinations: health monitoring, anomaly detection,
diagnosis, forecasting, what-if, and recommendation. Furthermore,
Ques targets a comprehensive long-term solution for system management
by automating the generation of plans for executing queries and
enforcing policies. This approach requires extensive collection and
analysis of system configuration and activity data---e.g., performance
metrics, resource utilization, execution and stack traces, error
messages, workload, network packets, source code, and help
manuals---both passively and through controlled system perturbation.
- Busy data centers generate more than 1 Terabyte of log data per
day. More fine-grained logging can make this size 100x larger. To
query such massive time-varying datasets, Ques is pushing the envelope
of data-stream technology where data is modeled as continuous streams
and queried using a "you-get-one-look" approach. In addition, Ques
supports controlled data collection to balance the inherent
cost-accuracy tradeoff.
- Ques is removing technical barriers to automated policy-driven
control of computing systems. Recent industrial initiatives like
IBM's Autonomic Computing and Microsoft's Dynamic Systems Initiative
highlight the pressing need for such control.
- Current system-management products are usually of little help to
desktop users facing unexpected system slowdowns or misbehaving
applications. Ques makes system management accessible to the large
and diverse class of users and developers who administer their own
systems.
- Current system-management products have fairly rigid interfaces and
require a lot of system expertise to use. Ques is rethinking the
system-management interface for a broad spectrum of potential users.
As one example, system administrators need effective ways to input
domain knowledge, while desktop users prefer keyword queries on a
personalized engine for desktop and Internet search.
We are committed to building a fully-functional prototype of Ques
and deploying it in real-world settings. With each novel component of
Ques, we will: (i) perform the research and evaluation using a
prototype in a testbed setting, with both synthetic and real
applications and data, (ii) demonstrate the prototype at a leading
conference, (iii) make the demonstration available publicly on the
Internet, (iv) do a real-world deployment and user studies if there is
sufficient interest, and (v) release the source code publicly. The
effectiveness of Ques will be tested by deploying it to manage
workloads on a virtualized, service-oriented, and on-demand computing
platform on our departmental research-computing cluster. We have also
had encouraging preliminary discussions with the administrators of an
university-wide production cluster used heavily for
computational-science applications. We have established industrial
collaborations (IBM) with the eventual goal of transferring technology
from Ques to industrial-strength system-management products.
Project Members
-
Shivnath Babu, Associate Professor, Duke Computer Science
-
Botong Huang, Ph.D. Candidate, Duke Computer Science
-
Harold Lim, Ph.D. Candidate, Duke Computer Science
-
Jie Li, Ph.D. Candidate, Duke Computer Science
Collaborators
-
Ashraf Aboulnaga, Associate Professor, Computer Science, University of Waterloo
-
Jeff Chase, Professor, Duke Computer Science
-
Brent Miller, Autonomic Computing Group, IBM
-
Kamesh Munagala, Associate Professor, Duke Computer Science
-
Sandeep Uttamchandani, IBM Almaden Research Center
-
Jun Yang, Associate Professor, Duke Computer Science
Alumni
-
Nick Bodnar,
Worked on Ques as an undergraduate student
-
Nedyalko Borisov, Worked on Ques as a Ph.D. student,
First employment at Facebook
-
Garrett Bressler, Worked on Ques as a high-school student, joined Brown University
-
Brian Cook,
Worked on Ques as an M.S. student, joined IBM
-
Songyun Duan,
Worked on Ques as a Ph.D. student, First employment at IBM Research
-
Peter Franklin,
Worked on Ques as an undergraduate student, First employment at Microsoft
-
Herodotos Herodotou,
Worked on Ques as a Ph.D. student, First employment at Microsoft
-
Jack Li, Worked on Ques as an undergraduate student
-
Piyush Shivam,
Worked on Ques as a Ph.D. student, First employment at Sun Microsystems
-
Jonathan Su, Worked on Ques as an undergraduate student
-
Vamsidhar Thummala, Ph.D. Candidate, Duke Computer Science
-
Dongdong Zhao, Worked on Ques as an M.S. student
Publications
-
N. Borisov and S. Babu.
Rapid Experimentation for Testing and Tuning a Production
Database Deployment
In Proc. of the International Conference on Extending Database Technology (EDBT),
March 2013
-
H. Lim, Y. Han, and S. Babu.
How to Fit When No One Size Fits
In Proc. of the Sixth Biennial
Conference on Innovative Data Systems Research (CIDR), January 2013
- R. Thonangi, S. Babu, and J. Yang.
A Practical Concurrent Index for Solid-State Drives
In Proc. of CIKM, October 2012
-
H. Lim, H. Herodotou, and S. Babu.
Stubby: A Transformation-based Optimizer for MapReduce Workflows
In Proc. of PVLDB 5(11), August 2012
-
H. Herodotou, F. Dong, and S. Babu.
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
In Proc. of the ACM Symposium on Cloud Computing 2011 (ACM SOCC 2011), October 2011
- H. Herodotou and S. Babu.
Profiling, What-if Analysis, and Cost-based Optimization
of MapReduce Programs
In Proc. of the 2011 Intl. Conference on
Very Large Data Bases (VLDB), August 2011
- H. Herodotou, N. Borisov, and S. Babu.
Query Optimization Techniques for Partitioned Tables
In Proc. of the
2011 ACM Intl. Conf. on Management of Data (SIGMOD), June 2011
- H. Herodotou and S. Babu.
Xplus: A SQL-Tuning-Aware Query Optimizer
In Proc. of PVLDB Volume 3 (the International Conference on Very Large Databases (VLDB)), September 2010
- S. Babu.
Towards Automatic Optimization of MapReduce Programs
In Proc. of the ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010), June 2010
- H. Lim, S. Babu and J. Chase.
Automated Control for Elastic Storage
In Proc. of the Intl. Conference on Autonomic Computing (ICAC 2010), June 2010
- S. Duan, V. Thummala, and S. Babu.
Tuning Database Configuration Parameters with iTuned
In Proc. of the International Conference on Very Large Databases (VLDB), August 2009
-
H. Herodotou
and S. Babu.
Automated SQL Tuning through Trial and (Sometimes) Error
In Proc. of the Second Workshop on
Testing Database Systems (DBTest),
June 2009
-
M. Ahmad,
A. Aboulnaga,
and S. Babu.
Query Interactions in Database Workloads
In Proc. of the Second Workshop on
Testing Database Systems (DBTest),
June 2009
- A. Demberel, J. Chase, and S. Babu.
Reflective Control for an Elastic Cloud Application: An Automated Experiment Workbench
In Proc. of the First Workshop on
Hot Topics in Cloud Computing (HotCloud), in conjunction with USENIX Annual Technical Conference, June 2009
- H. Lim, S. Babu, J. Chase, and S. Parekh.
Automated Control in Cloud Computing: Challenges and Opportunities
In Proc. of the First Workshop on Automated Control
for Datacenters and Clouds, June 2009
- S. Babu, N. Borisov, S. Duan, H. Herodotou, and V. Thummala.
Automated Experiment-Driven Management of (Database) Systems
In Proc. of
the 12th Workshop on
Hot Topics in Operating Systems (HotOS), May 2009
- S. Duan, S. Babu, and K. Munagala.
Fa: A System for Automating Failure Diagnosis (full paper)
In Proc. of
2009 IEEE International Conference on Data Engineering (ICDE), April 2009
- S. Babu, N. Borisov, S. Uttamchandani,
R. Routray, and
A. Singh.
DIADS: Addressing the "My-Problem-or-Yours" Syndrome with
Integrated SAN and Database Diagnosis
In Proc. of
the USENIX Conference on File and Storage Technologies (FAST), February 2009
- N. Borisov, S. Uttamchandani,
R. Routray, and
A. Singh.
Why Did My Query Slow Down?
In Proc. of the 2009 Conference on Innovative Data Systems Research (CIDR),
January 2009
- S. Duan and S. Babu.
Empirical Comparison of Techniques for Automated Failure Diagnosis
In Proc. of the Third Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), December 2008
-
M. Ahmad,
A. Aboulnaga,
S. Babu, and
K. Munagala.
Modeling and Exploiting Query Interactions in Database Systems
In Proc. of
ACM International Conference on Information and Knowledge Management (CIKM),
October 2008
-
R. Thonangi, V. Thummala, and S. Babu.
Finding Good Configurations in High-Dimensional Spaces: Doing
More with Less
In Proc. of the IEEE International Symposium
on Modeling, Analysis, and Simulation
of Computer and Telecommunication Systems (MASCOTS),
September 2008
-
S. Babu.
Grand Challenge: Experiment-driven Adaptive Systems
Vision paper written for invitation to the third
Workshop on Hot Topics in Autonomic Computing (HotAC III),
June 2008
-
P. Shivam, V. Marupadi, J. Chase, and S. Babu.
Cutting Corners: Workbench Automation for Server Benchmarking
In Proc. of the 2008 USENIX Annual Technical Conference,
June 2008
- S. Duan and S. Babu.
Guided Problem Diagnosis through Active Learning
In Proc. of the International Conference on Autonomic Computing (ICAC), June 2008
- S. Babu, S. Duan, and K. Munagala.
Processing Diagnosis Queries: A Principled and Scalable Approach
Poster at the International Conference on Data Engineering (ICDE), April 2008.
-
M. Ahmad, A. Aboulnaga, S. Babu, and K. Munagala.
QShuffler: Getting the Query Mix Right
Poster at the International Conference on Data Engineering (ICDE), April 2008.
- S. Duan and S. Babu.
Processing Forecasting Queries
In Proc. of the International Conference on Very Large Databases (VLDB), September 2007
- B. Chandramouli, C. Bond, S. Babu, and J. Yang.
Query Suspend and Resume
In Proc. of the
2007 ACM Intl. Conf. on Management of Data (SIGMOD), June 2007
- A. Yumerefendi, P. Shivam, D. Irwin, P. Gunda,
L. Grit, A. Demberel, J. Chase, and S. Babu.
Towards an Autonomic Computing Testbed
In Workshop
on Hot Topics in Autonomic Computing (HotAC), June 2007
- B. Cook, S. Babu, G. Candea, and S. Duan.
Towards Self-Healing Multitier Services
In Second
Intl. Workshop on Self-Managing Database Systems (SMDB), April
2007
- B. Chandramouli, C. Bond, S. Babu, and J. Yang.
On Suspending and Resuming Dataflows (poster).
In Proc. of IEEE
International Conference on Data Engineering (ICDE), April 2007
- P. Shivam, S. Babu, and J. Chase.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications
In Proc. of the International Conference on Very Large Databases (VLDB), September 2006
- P. Shivam, S. Babu, and J. Chase.
Active Sampling for Accelerated Learning of Performance Models
In Proc. of the First Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), June 2006
- P. Shivam, S. Babu, and J. Chase.
Learning Application Models for Utility Resource Planning
In Proc. of IEEE International Conference on Autonomic Computing (ICAC), June 2006
- S. Babu, P. Bizarro, and D. DeWitt.
Proactive Re-optimization
In Proc. of the
2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005
The Rio system described in this paper
was demonstrated at
SIGMOD 2005, June 2005
- S. Babu and P. Bizarro. Adaptive Query Processing in the Looking Glass
In Proc. of the Second Biennial Conference on Innovative Data Systems Research (CIDR), January
2005
Demonstrations
- H. Lim and S. Babu.
Execution and Optimization of Continuous Queries with Cyclops.
Demonstrated at the
2013 ACM Intl. Conf. on Management of Data (SIGMOD 2013), June 2013
-
H. Herodotou, F. Dong, and S. Babu.
MapReduce Programming and Cost-based Optimization?
Crossing this Chasm with Starfish.
Demonstrated at the 2011 International Conference on
Very Large Data Bases (VLDB), August 2011
- V. Thummala and S. Babu.
A Tool for Configuring and Visualizing Database Parameters
Winner of the SIGMOD'10 Best-demo Award Competition!
Demonstrated at the
2010 ACM Intl. Conf. on Management of Data (SIGMOD 2010), June 2010
- S. Duan, P. Franklin, V. Thummala, D. Zhao, and S. Babu.
Shaman: A Self-Healing Database System
Demonstrated at the
2009 IEEE International Conference on Data Engineering (ICDE), April 2009
- S. Duan and S. Babu.
Automated Diagnosis of System Failures with Fa
Demonstrated at the
2009 IEEE International Conference on Data Engineering (ICDE), April 2009
- P. Shivam, A. Demberel, P. Gunda, D. Irwin,
L. Grit, A. Yumerefendi, S. Babu, and J.
Chase.
Automated and On-Demand Provisioning of Virtual Machines for Database Applications
Demonstrated at the
2007 ACM Intl. Conf. on Management of Data (SIGMOD 2007), June 2007
- S. Duan and S. Babu.
Proactive Identification of Performance Problems
Demonstrated at the
2006 ACM Intl. Conf. on Management of Data (SIGMOD 2006), June 2006
- S. Babu, P. Bizarro, and D. DeWitt.
Proactive Re-optimization with Rio
Demonstrated at the
2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005