IBM: LEO Query Optimizer

1. explain figure 2 in the paper:
		For a join operation, any possible plan (nested loop join, hash join and sort join) may be more efficient than the others, dependent on the property of the tables and the query results;
		
		If most tuples satisfy a specific query, full scan can be more efficient than index-based ones, because full scan can be done with sequential access, while the index-based needs random access to the disk.
		
		For efficiency, we should put similar tuples (according to the query conditions) together on the disk to save disk accesses. There is a paper "weaving relations for cache performance" addressing this issue;
		
2. Statistics: e.g. histograms (equal-width or equal-height), quantiles;
	 Indexes, index-Anding, materialization;		 
	 
	 Cost function: cost(stats, plan, resource)
	 Optimizer: Evaluate the cost of possible plans based on the statistics (actual or estimated)	and resource assigned, then choose the least expensive plan for execution;
	 
	 However, the statistics may change as query is processed, and the assumptions (uniformity, inclusion, independence) may not hold in some situations, which incurs inaccuracy in cost estimation so that the execution plan selected may be unoptimal.
	 
	 Statistics are not updated instantly. Assumptions made by optimizer do not hold in some cases: 
	 		
	 		Uniformity: data may be skew instead;
	 		Inclusion:
	 		Independence: this assumption is always used for selectivity (cardinality) estimation, but in the following example this assumption fails.
	 				Select	Maker, Model
	 				From 		CarDatabase
	 				Where		Maker = Honda, Model = Accord
	 				
	 				Suppose the selectivity of "Maker=Honda" is 1/20, the selectivity of "Model=Accord" is 1/100, actually, the selectivity of "Maker=Honda, Model=Accord" is 1/100, but the initial estimation of 1/20*(1/100) is used to choose execution plan. As query is processed, statistics are updated and the actual selectivity is found to be far from estimation, so the query processing is reoptimized.
	 				
3. Optimizer: inter-query and intra-query optimization.
	 Issues: 	(1) stability, convergence 
	 					(2) detecting a problem--e.g., thrashing.
	 								Join (a,b,c,d), the join sequence may be changed as query is processed, because plan is selected according to incomplete knowledge (thereafter inaccurate).
	 					(3) deciding to respond to a problem
	 					(4) dealing with bias
	 					
ORACLE ADDM:

4. DB-time: common currency to measure the effect each component has on the whole reponse time;
	 DB-graph in two dimensions: resource dimension and phase dimension.
	 
	 Relationship and differences between these two papers:
	 
	 		IBM paper uses feedback mechanism to improve query processing efficiency, which can work in online fashion; while Oracle paper deals with the whole transaction processing and diagnosis, and works in offline fashion.