POP/FED: Progressive Query Optimization for Federated
Queries in DB2
Holger Kache ...
remote data sources. After the query is optimized and the important parameter and also the one subject to th...
the re-use of intermediate results; an effect that we very
much want to utilize for federated queries to reduce
communicat...
5. The remote statement represented at SHIP(42) is 5. CONCLUSION
SELECT A1."N_NATIONKEY" FR...
of 4

POP/FED: Progressive Query Optimization for Federated Queries ...

Published on: Mar 4, 2016
Source: www.slideshare.net


Transcripts - POP/FED: Progressive Query Optimization for Federated Queries ...

  • 1. POP/FED: Progressive Query Optimization for Federated Queries in DB2 Holger Kache Wook-Shin Han, Volker Markl, Stephan Ewen Vijayshankar Raman IBM Silicon Valley Lab IBM Almaden Research Center IBM Boeblingen Lab 555 Bailey Avenue 650 Harry Road Schönaicher Str. 220 San Jose, CA, USA San Jose, CA, USA 71032 Böblingen, Germany kache@us.ibm.com {wookshin,marklv,ravijay} ewens@de.ibm.com @us.ibm.com ABSTRACT object in the remote data source. While in a non-federated Federated queries are regular relational queries accessing data on database, the query execution plan defines an access one or more remote relational or non-relational data sources, strategy for the local relational objects that reside in the possibly combining them with tables stored in the federated database where the plan was compiled, a federated query DBMS server. Their execution is typically divided between the execution also includes an access strategy for the objects in federated server and the remote data sources. Outdated and the remote data source. Based on the complete cost model, incomplete statistics have a bigger impact on federated DBMS the query optimizer will choose the optimal global query than on regular DBMS, as maintenance of federated statistics is unequally more complicated and expensive than the maintenance access plan with regards to the total query execution time. It of the local statistics; consequently bad performance commonly will consider different join strategies for the join of the occurs for federated queries due to the selection of a suboptimal remote and local data, different join orders, different points query plan. To solve this problem we propose a progressive to transfer data between data sources for multiple joins, optimization technique for federated queries called POP/FED by trade off between local and remote joins, and even push extending the state of the art for progressive reoptimization for down of predicates to a remote data source as opposed to local source queries, POP [4]. POP/FED uses (a) an local processing. opportunistic, but risk controlled reoptimization technique for federated DBMS, (b) a technique for multiple reoptimizations The federated query optimizer, however, does not influence during federated query processing with a strategy to discover the remote query access plan. It sends a SQL string redundant and eliminate partial results, and (c) a mechanism to representing the remote sub-statement to the remote data eagerly procure statistics in a federated environment. In this source only and the plan decision for the sub-statement is demonstration we showcase POP/FED implemented in a entirely left to the remote data source. The remote data prototype version of WebSphere Information Integrator for DB2 source, in turns, compiles the sub-statement and generates a using the TPC-H benchmark database and its workload. For selected queries of the workload we show unique features local access strategy for the objects that it owns. It may or including multi-round reoptimizations using both a new graphical may not employ a query optimizer to do that, depending on reoptimization progress monitor POPMonitor and the DB2 the nature and capabilities of the data source. The federated graphical plan explain tool. server retrieves the data returned by the remote data source to join it with data from another remote data sources or 1. INTRODUCTION local data stored at the federated server itself. At this point In a federated database, data is integrated from different in time it executes the federated query plan. remote data sources, without the requirement to replicate or otherwise copy the data to the federated database instance. 2. DB2 FEDERATED QUERIES Federated databases use references instead that point to the In DB2, federated queries are prepared and processed like objects living in the remote data source. They are called regular relational queries with a few extra steps specifically nicknames, or index specifications and point to a physical introduced for federated queries. In the query prepare phase, the query is first parsed and rewritten using semantic Permission to copy without fee all or part of this material is granted rules. During that phase, the query compiler determines provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and which quantifiers, columns, and predicates can be sent to its date appear, and notice is given that copying is by permission of the which remote data source and marks them with a push- Very Large Data Base Endowment. To copy otherwise, or to republish, to down flag. In the next phase, the query is optimized with post on servers or to redistribute to lists, requires a fee and/or special respect to that push-down flag using a cost-based optimizer. permission from the publisher, ACM. VLDB ‘06, September 12–15, 2006, Seoul, Korea. The plan costs are computed differently for the local parts Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 of the plan and the parts that are to be processed by the 1175
  • 2. remote data sources. After the query is optimized and the important parameter and also the one subject to the gravest access strategy is clear, the query compiler generates the estimation error. It computes the validity range around it, SQL string for every sub-statement to be sent to a remote an interval that describes for which cardinality range the data source. During query execution, DB2 uses the common current plan is truly the optimal one. It then places CHECK APIs available for the remote data source to submit the operators at strategic points, which in turn validate during SQL statements and fetches the results back into local data plan execution that the actual cardinality, obtained from the structures. The results are retrieved and the federated server runtime monitor, is within the validity range. If this is not starts processing the local plan operators using the results the case, all intermediate results from fully materialized from the remote sources. In the final fetch phase, it delivers points are retained and the optimizer is called again. The the query results back to the requesting application. actual cardinalities from the aborted query execution are Cost based optimization of federated queries transparently made available to the optimizer so that it is able to develop extends optimization across data sources, by introducing a better plan, which is not subject to the estimation error communication cost, but otherwise treating remote tables that caused the reoptimization. Note that this makes POP similar to local tables and by introducing a source- or suitable for any source of cardinality estimation error, be it server property that describes where the processing of the bad statistics, wrong assumptions, or parameter markers. current plan operator happens. A special operator (SHIP) The retained intermediate results are treated as materialized describes the point in the QEP where intermediate results views, also called materialized query tables (MQT) or are communicated between a remote data source and the automatic summary tables in DB2 [6]. The optimizer has federated DBMS. The statistics that are used to estimate the cost based choice to match them back into the plan, cardinalities for remote base tables are in most cases enabling the query to basically continue from the point it obtained from the remote data source, since the gathering of was aborted for reoptimization, avoiding the re-execution of statistics on remote data is very expensive for the federated previously executed parts. Figure 2 shows an example of DBMS. The varieties of relational DBMSs, which can be a this process. remote source, employ different optimizers and utilize P P P different forms of statistics. Often, the federated server can Add CHECK Reoptimize NLJN NLJN MGJN only exploit very basic statistics about the number of rows in a table. The federated DBMS’s optimizer is hence not able to model data distribution and correlation in detail, as SHIP SHIP CHECK SHIP CHECK SCAN this would require distribution and multivariate statistics. 11(R) 12(S) 12(S) SORT SORT TEMP The worst cases are federated queries that access non- 11(R) relational remote data sources or remote DBMSs that do SHIP SHIP 11(R) 12(S) not employ a cost based optimizer. In those cases, there are no statistics on the remote data available at all and the Reuse optimizer is forced to derive its cardinality estimates from default values. Figure 1. POP reoptimizing a sub-optimal nested loop join for 2 federated sources. 3. POP/FED OVERVIEW The left side shows the initial plan for the example used 3.1 Concept above. 11(R) represents the sub-statement to read from the Progressive Optimization (POP) is a technique that breaks owner table and 11(S) the statement to read from the car with the fixed sequence of query prepare, execute, and table. During optimization, POP computes the validity fetch. It is a compromise between static optimization and ranges around the edges of the plan and places CHECK continuous dynamic optimization which allows us to operators at places that are suitable or performance critical. optimize a query again during query runtime. Especially for The CHECK operator, in this case with artificial federated queries, where the query execution phase is materialization, takes the validity range of its child edge as exceptionally complex and includes the remote query parameter. During runtime, it identifies whether the actual compilations, remote execution, and remote fetch phases, cardinality is within validity ranges, and triggers followed by the actual local execution, it makes sense to reoptimization if not. The optimizer uses knowledge about allow an additional query compile cycle. That compilation the actual cardinality to develop the new plan; the can now be based on actual cardinality values as discovered intermediate result is matched into the plan as a temporary during query runtime, rather than cardinality estimates. table (right side). [4] introduces different flavors of check operators for eager checking (tuple pipelines) and lazy During the initial query compilation, POP determines checking (full materialization points in a QEP). The current criterions for estimated parameters that are required to hold prototype supports only the lazy variant, which is also the if the plan is to be the optimal one. The current prototype preferable one for federated queries, as it solely supports uses only the estimated cardinality, which is the most 1176
  • 3. the re-use of intermediate results; an effect that we very much want to utilize for federated queries to reduce communication cost. Furthermore, so far no research has proposed a good way to determine the validity range for eager checkpoints, which has to consider the cost inherent to partial re-execution. 3.2 Multiple Reoptimizations For federated queries, the number of reoptimizations is commonly as high as the number of uncorrelated SHIP operators in the federated query plan, possibly higher if correlation on join predicates that span several SHIPs Figure 2. Screenshot for POPMonitor. occurs. One potential problem associated with multiple to the federated TPC-H queries demonstrates its ability to rounds of reoptimization is the stockpiling of partial results, improve the quality of the federated query plans. POP does as each iteration introduces new temporary tables. POP is not interfere with the remote plan compilation and it is not forced to reuse partial results but rather performs the apparent that only the local portion of the plan can be decision to reuse them on a cost base. Through this optimized. Therefore, we will demonstrate queries that have mechanism, it occurs that POP ignores partial results but complex local plans (e.g. complex joins) and put a high reconsiders them after another round of reoptimization or load on the federated server. decides to fall back to another partial result; this happens especially when new knowledge that was added in the The story-line of the demonstration is the following. course of another reoptimization compensated for 1. We start with a partially loaded TPC-H Oracle database. correlation on join predicates. It is consequently dangerous Only the PART, SUPPLIER, CUSTOMER, NATION, and and regressive to throw away partial results as soon as POP REGION tables are loaded and statistics are updated for the does not consider them during a reoptimization. tables and indexes. The missing LINEITEM and ORDERS tables are loaded into a DB2 database, which is enabled as a POP/FED provides a technique for multiple reoptimizations federated database. with a strategy to discover redundant and eliminate partial results. Dropping redundant partial results here ensures that 2. Federated connection to the remote Oracle database is set up as documented in [3], and nicknames to the Oracle tables are the DBMS processes the query, and also other concurrently created in the local schema ‘ORA’. Since we created the running queries, with the maximum possible temporary nicknames after we populated the data into the Oracle TPC- storage space. However, when we drop a redundant partial H database, we will automatically pick up the correct results, we keep statistics as a virtual statistical view. federated statistics for the underlying Oracle base tables. Figure 2 shows a screenshot for POPMonitor, which is a 3. To demonstrate a number of capabilities of POP for new graphical progress monitor tool for the demo. In each federated queries, we run TPC-H query 8 using the Oracle round of reoptimization, POPMonitor 1) invokes the db2 and DB2 tables. graphical explain tool showing the reoptimized plan, and 2) 4. The initial plan compiled by the federated query compiler displays partial results maintained as virtual materialized uses a local hash join for the results of the remote views and statistical views. With POPMonitor, one is able NATION/REGION join and the remote CUSTOMER table. to understand how POP/FED reoptimizes plans and That is the cheapest option only if CHECK(40) returns at maintains partial results throughout multiple rounds of least 25 rows. reoptimization. As shown in this figure, POPMonitor also shows QGM information for a given virtual materialized view using a pop-up window. 4. DEMONSTRATION We demonstrate POP for federated queries with the TPC-H workload as defined by the Transaction Processing Performance Council (TPC). The TPC-H tables are distributed amongst an Oracle database and a DB2 database. The federated queries are executed using the IBM Information Integrator product, which implements the DB2 federated query processing functionality. POP/FED applied 1177
  • 4. 5. The remote statement represented at SHIP(42) is 5. CONCLUSION SELECT A1."N_NATIONKEY" FROM POP/FED, an extension of POP for a single local source "TPCH"."REGION" A0, "TPCH"."NATION" A queries, is a powerful technique for progressively WHERE (A0."R_NAME" = 'AMERICA') AND reoptimizing federated queries. The problem of incorrect or (A1."N_REGIONKEY" = A0."R_REGIONKEY") incomplete statistics is far greater for federated queries than and returns 5 rows. At CHECK(40) POP/FED interrupts for non-federated queries. The federated query compiler has query execution and creates a virtual materialized view to make assumptions about the complexity and costs of consisting of 2 quantifiers (NATION, REGION) and 2 remote statements without actual knowledge about the predicates (R_NAME, N_REGEIONKEY). remote query plans. Thus, the federated query plan faces 6. Reoptimization is triggered and a POP/FED proposes a new the danger of being sub-optimal. POP/FED reoptimizes plan that pushes down the 3 table join between Oracle queries for an arbitrary number of times and avoids wasting NATION, REGION, and CUSTOMER tables. storage space by analyzing partial results for redundancy and cleaning up after each reoptimization. For federated queries that were optimized with little knowledge, early materialization reorders the subplans in a way that data access, in the federated case access to the remote results, is done prior to the actual plan execution. It provides knowledge about actual cardinalities earlier and reduces number of reoptimizations. Through a more evenly provided knowledge, the optimizer runs less risk of getting into a plan bias. The materialized view created in 6 is not used because the quantifiers and predicates don’t match for the 3 table join. 6. REFERENCES 7. During execution of the new plan, a second reoptimization is [1] Aboulnaga, A., Haas, P., Lightstone, S., Lohman, G., Markl, triggered by another CHECK point in the plan. POP/FED V., Popivanov, I., and Raman, V. Automated Statistics materializes the new results for the 3 table join into another Collection in DB2 Stinger, Proc. VLDB 2004. virtual materialized view [2] Ewen, S., Kache, H., Markl, V., and Raman, V., Progressive CREATE VMV 11bdf1d8 QTB(5) OPR(27) Query Optimization for Federated Queries. Proc. EDBT nQuns(3) nPreds(3) nQncs(1) Quns(9 10 12 2006 (accepted). ) Prds(6 7 8 ) Qncs(43 ) [3] IBM DB2 Information Integrator Federated Systems Guide Version 8.2, IBM Corp 2004. 8. The third plan proposed by POP/FED reuses the results [4] Mark, V., Raman, V., Simmen, G., Lohman, G., Priahesh, materialized in step 7. The quantifiers and predicate lists H., and Cilimdzic, M. Robust Query Processing through match and the cost of computing the result suggests the use Progressive Optimization. Proc. ACM SIGMOD 2004. of the materialized virtual view. [5] Stillger, M., Lohman, G., Markl, V., and Kandil, M. LEO – DB2’s Learning Optimizer, Proc. VLDB 2001. [6] Zaharioudakis, M., Cochrane, R., Lapis, G., Pirahesh, H., and Urata, M. Answering Complex SQL Queries Using Automatic Summary Tables, Proc. ACM SIGMOD 2000. 1178

Related Documents