Scribed Notes (Scriber: Emma Buneci) Paper Title: Automated Statistics Collection in DB2 UDB A. Aboulnaga, P. Haas, S. Lightstone, G. Lohman, V. Markl, I. Popivanov, V. Raman Presenter: Ranjith Vasireddy ===Discussion on availability -What is the impact of the optimizer on the availability of the system? -The paper claims that during the maintanence window. 2160 seconds = 36 minutes of downtime per year ==> not a very available system. -Experimental criticism: simple system with 4 tables and controlled-workload may not be able to achieve five 9s. ===Discussion on trusting a complex model -The customers in this paper are system administrators, and the problem is that you cannot take a Markov chain and explain it to a system adminsitrator. -It is the question of trusting a complex model. === Motivation of work -Keep and update statistics about tables. -Why is it non-trivial what they are doing? 1) keeping updates of counts of table rows is easy 2) the problem is keeping updates of quantiales and other information about the distribution of data. === Discussion on query optimization -Given the right statistics, does the query optimization choose the right plan? -Answer: It's a very large space, collecting statistics is hard; it's still an open problem. There are other models in the optimization engine; so it could be that the models are incorrect, so it's still an open problem. === Discussion on current DBAs practices -How do current DBAs choose the frequency and length? -Answer: They learn some rules of thumb over time; non-scaling. === Discussion on technical issues - One counter is not enought to say for sure whether the actual statistics have changed or not. That's why they have a change analyzer. - Why are the histograms always monotonically increasing? - Because the predicates they are taking are monotonically increasing, the histograms for R and S are monotonically increasing ( the value of A<=20 includes A<=10). - Scheduling statistics collection: we run qfa in the morning. Between each set of runs you can compute how much the stats have changed. Divide the change by the time (eg: 24hours) you get the rate of change. - Experiments: Figure 10 and 11...why query 10 is doing better in figure 11 than figure 10: their algorithm has simply learned from the previous run that it needs to optimize.