DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER

Terrence Mason

Iowa Database and Emerging Applications Laboratory, Computer Science

University of Iowa

Ramon Lawrence

Iowa Database and Emerging Applications Laboratory, Computer Science

University of Iowa

Keywords:

integration, database, schema, metadata, annotation, evolution, dynamic, JDBC, conceptual, embedded.

Abstract:

Current integration techniques are unsuitable for large-scale integrations involving numerous heterogeneous

data sources. Existing methods either require the user to know the semantics of all data sources or they

impose a static global view that is not tolerant of schema evolution. These assumptions are not valid in

many environments. We present a different approach to integration based on annotation. The contribution is

the elimination of the bottleneck of global view construction by moving the complicated task of identifying

semantics to local annotators instead of global integrators. This allows the integration to be more automated,

scaleable, and rapidly deployable. The algorithms are packaged in an embedded database engine contained

in a JDBC driver capable of dynamically integrating data sources. Experimental results demonstrate that the

Unity JDBC driver efﬁciently integrates data located in separate data sources with minimal overhead.

1 INTRODUCTION

Large-scale integration involving numerous heteroge-

neous databases is limited by imposing a static global

view or requiring users to know the semantics of the

integrated data sources. A robust integration system

will dynamically update the global schema represent-

ing the integrated data sources. Our approach pushes

the most challenging task of integration, identifying

semantics, to local annotators instead of global inte-

grators. The system automates the tasks of matching

and resolving structural conﬂicts. Integrators are only

responsible for annotating local data with the neces-

sary context so that it may be integrated into the over-

all system. Annotation allows related concepts to be

identiﬁed and referenced using accepted terminology.

The advantage of matching on schema annotations

rather than the schemas themselves is that the over-

head of human interaction is limited to gathering

semantics at each database individually rather than

identifying relationships during the matching process.

This is important as the annotation only needs to be

performed once per database, as opposed to having

human involvement during every match operation.

This reduces the requirement that global integrators

fully understand the semantics of every schema.

The success of the integration rests on being able to

construct standardized annotation that can be matched

across systems. A common complaint against such an

approach is that users will not adopt standards. Al-

though this may be the case in some environments, it

is not universally true. There is an increasing trend

in organizations to standardize common terms, ﬁelds,

and database elements, especially in the medical and

scientiﬁc domains. For example, the National Can-

cer Institute Center for Bioinformatics (NCICB)

has

developed a standardized ontology called Enterprise

Vocabulary Services (EVS) (Covitz et al., 2003) for

common terms in the cancer domain. NCICB has also

created a Cancer Data Standards Repository (caDSR)

to standardize the data elements used for cancer re-

search. As part of a national initiative to build a grid

of cancer centers, individual centers must conform

their data to be compliant with Common Data Ele-

ments (CDEs) and EVS. This compliance is achieved

by annotating their data sources with the accepted ter-

minology. Thus, it is not uncommon to require a stan-

dard for participants who wish to share and integrate

their data. Once local sources are annotated to con-

form to the standards, the global-level matching is

considerably simpliﬁed because concepts will match

across databases if they agree on the particular stan-

dardized term or data element.

http://ncicb.nci.nih.gov

326

Mason T. and Lawrence R. (2005).

DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER.

In Proceedings of the Seventh International Conference on Enterprise Information Systems , pages 326-333

DOI: 10.5220/0002552103260333

 SciTePress

Unity enables the dynamic integration of large

numbers of data sources that are annotated accord-

ing to a reference vocabulary. The architecture is ca-

pable of handling large-scale integrations in evolving

environments, where the speciﬁc databases participat-

ing in the federation change frequently and their local

schemas evolve over time. In addition, the driver’s

database engine integrates distributed data sources

without requiring middleware or database server sup-

port and allows programmers simpliﬁed access to in-

tegration algorithms.

A lightweight integration system is implemented

inside a JDBC driver to provide a standard interface

for querying heterogeneous databases. Integration

is supported with an embedded database engine that

joins data located in multiple data sources into a sin-

gle result. The JDBC standard allows the execution

of queries in a general programming environment by

providing library routines which interface with the

database. In particular, JDBC has a rich collection

of routines which make the interface simple and intu-

itive. The result is increased portability and a cleaner

client-server relationship. For an integration system

in an evolving environment to succeed, it will require

an automatic and scalable approach with a standard

interface to access the data.

The challenge of integration is to build a platform-

independent system that supports automated con-

struction of an integrated view, handles schema evo-

lution, and allows the dynamic addition and deletion

of sources. Our contribution is an integration archi-

tecture that has several unique features:

• A method for annotating schemas using reference

ontologies and automatically matching annotated

schemas to produce a global view.

• A simple conceptual query language that reduces

the complexity often found in SQL queries.

• The system automatically determines the necessary

joins and relations in each data source and across

data sources.

• A system for tracking data provenance and detect-

ing inconsistent data across databases.

• Updates the global view automatically to handle

addition or deletion of data sources.

• Algorithms implemented in the standard JDBC in-

terface.

The organization of this paper is as follows. Sec-

tion 2 reviews existing integration systems. In Sec-

tion 3, the JDBC driver implementation and its archi-

tecture is described. Section 4 details performance

experiments that show the JDBC implementation has

minimal overhead for integrating data sources. The

paper closes with future work and conclusions.

2 PREVIOUS WORK

Several data integration prototypes (Goh et al., 1999;

Kirk et al., 1995; Li et al., 1998) have been developed.

In the literature, there are two basic types of data inte-

gration systems (Halevy, 2001; Ullman, 1997): global

as view (GAV) systems and local as view (LAV)

systems. The difference between the two types is

how the mappings are expressed between the global

view and the source. Although schema matching

systems (Rahm and Bernstein, 2001) semi-automate

mapping discovery between the global view and the

sources, they have proven less successful in construct-

ing the global view itself. While progress contin-

ues to be made using a reference ontology (Dragut

and Lawrence, 2004), global view construction and

its evolution remains an open issue.

Data integration systems (Lenzerini, 2002) require

a global view to be constructed before local sources

are integrated into the federation. The bottleneck

in the integration process is the construction of the

global schema, as schemas do not contain enough in-

formation to allow conﬂicts between data representa-

tions to be automatically identiﬁed and resolved. Cur-

rent approaches to integration (Lenzerini, 2002) as-

sume static schemas, data sources, and global level

knowledge to build the integrated view. In prac-

tice, integrators have limited knowledge of the data

sources. Systems must be integrated quickly, and the

scale of integration makes deﬁning global relations

challenging.

Many commercial systems are based on federations

(Sheth and Larson, 1990) such as federation in DB2

(Haas et al., 2002). These systems allow the shar-

ing of schemas across databases but require users to

perform the integration through query formulation.

Federation is not desirable for large-scale integrations

because of the complexity in writing queries. Such

queries are susceptible to evolution and require the

user to understand the semantics of all schemas.

Using an ontology for integration has been used

previously (Collet et al., 1991; Decker et al., 1998).

These systems require sources to completely commit

to a global ontology and manually map all of their

data to the ontology. Since ontologies have more

powerful modeling constructs, their construction is

more challenging, and it becomes an issue if the lo-

cal sources will commit to using such an ontology. It

remains challenging to construct and reﬁne a global

ontology. Unity does not require complete confor-

mance to an ontology, but only a sharing of standard

terms for common elements. No previous integration

system has been deployed in a standard API such as

JDBC.

DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER

327

3 INTEGRATION

ARCHITECTURE

The unique features and algorithms of Unity are im-

plemented in a JDBC driver allowing programs to

transparently query one or more databases. The abil-

ity to integrate databases without additional software

makes it easier and faster to integrate systems in dy-

namic environments. The driver isolates the com-

plexities of integration from the application program-

mer by automatically building an integrated view and

mapping conceptual queries to SQL. While the sys-

tem is deployed in the Unity JDBC driver (Figure 1),

the algorithms are not speciﬁc to a JDBC implemen-

tation and can be deployed using other technologies.

The TPC-H

schema is partitioned into two schemas

(Figure 2) to provide an example integration scenario

for this paper. A program invoking standard JDBC

method calls will be used to describe the Unity JDBC

Driver system architecture and integration algorithms.

The functions are exactly the same as those in a single

source JDBC driver. The difference is that the Unity

driver allows the programmer to transparently access

multiple sources instead of just one.

Figure 1: JDBC Integration Driver Architecture

3.1 Building the Global Schema

Given the importance of names and text descriptions

in schema matching, it is valuable for schemas to

be annotated before they are integrated. Our sys-

tem uses annotation to build the global view in a

bottom-up fashion. A data source administrator an-

notates a schema with meaningful names (possibly

http://www.tpc.org/tpch

Part Database

part(p partkey, p name, p mfgr)

supplier(s suppkey, s name, s nationkey)

partsupp(ps partkey,ps suppkey)

nation(n nationkey, n name, n regionkey)

region(r regionkey, r name)

Order Database

customer(c custkey, c name, c nationkey)

orders(o orderkey, o custkey, o orderdate)

lineitem(l orderkey,l partkey,l suppkey,l linenum,l qty)

nation(n nationkey, n name, n regionkey)

region(r regionkey, r name)

Figure 2: Database Schemas

using an ontology such as EVS) and exports the anno-

tated schema in an XML document. The JDBC driver

loads the individual annotations, matches names in

the annotations to produce an integrated view, and

then identiﬁes global keys for use in joins across data-

bases.

In the cancer domain (discussed in Section 1),

matching on annotations is trivial as each common

data element has a unique id and name. In other do-

mains, the matching may be more complex, especially

if standard conformance is loosely enforced. It is im-

portant to emphasize that the annotation, not the local

or global schemas, is built according to an accepted

standard. Unlike ontology-based systems, sources do

not commit to a standard ontology/schema, but agree

to annotate their existing schemas in a standardized

fashion. The global schema is built by matching an-

notations in a bottom-up fashion and will contain only

the concepts present in the sources.

Figure 3 shows a partial annotation of the Part re-

lation in the Part database of Figure 2. The ﬁeld

p partkey is annotated as Part.Id in the global schema.

Two ﬁelds match if they have the same semantic

name. This provides the mechanism for matching

ﬁelds with the same domains across multiple data-

bases. The primary and foreign keys are used to

identify potential joins. The annotation provides the

global attribute names for the query language. This

ﬂexible, scaleable, and unobtrusive approach suc-

ceeds as a site must annotate only local schemas so

that they can be integrated with other systems.

The global schema produced by matching the Part

database with the Order database is shown in Figure

4. The global schema is a universal relation of all

attributes in the underlying schemas and is produced

by taking the attributes of each schema and merging

them into the universal view. Since attribute domains

are uniquely identiﬁed by the annotation, this merge

process is completely automatic.

Mismatches are possible at the global level if the

annotators assign an incorrect term to their attribute.

Note that we expect there to be some errors during the

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

328

<TABLE>

<FIELD>

<ﬁeldName>p

name</ﬁeldName>

<dataTypeName>varchar</dataTypeName>

<ﬁeldSize>55</ﬁeldSize>

</FIELD>

<FIELD>

<ﬁeldName>p

partkey</ﬁeldName>

<ﬁeldSize>10</ﬁeldSize>

</FIELD>

<keyScopeName>Organization</keyScopeName>

<ﬁeldName>p

partkey</ﬁeldName>

</FIELDS>

</PRIMARYKEY>

</TABLE>

Figure 3: Partial Annotation of Part Table

automatic matching because annotations may not be

perfect. The global view is iteratively reﬁned to cor-

rect such errors. However, the system is resistant to

such errors as either a match goes undiscovered (cre-

ates two attributes in the global view instead of one)

or a single attribute in the global view is used for two

or more distinct attribute domains. When these errors

are detected at the global level, they are resolved by

modifying the local annotations. The result of merg-

ing annotations is independent of the order that they

are merged. Note that the M-N relation PartSupp is

abstracted from the global view as it will only be used

during the local inference process.

Part.Id, Part.Name, Part.Manufacturer

Supplier.Id, Supplier.Name, Supplier.Nation.Id

Order.Id, Order.Customer.Id, Order.Date

LineItem.Linenumber, LineItem.Order.Id

LineItem.Quantity, LineItem.Part.Id, LineItem.Supplier.Id

Customer.Id, Customer.Name, Customer.Nation.Id

Nation.Id, Nation.Name, Nation.Region.Id

Region.Id, Region.Name

Figure 4: Global Schema

Note that the Unity JDBC driver could be used

even when no annotation is performed and conceptual

querying is not used. In this case, all relations from

all databases are imported into the global view but

not matched. Integration occurs when users explic-

itly state the joins in their queries. Thus, at the lowest

level, the JDBC driver functions as a standard fed-

erated system allowing distributed access to the data

sources. However, its true beneﬁt is abstracting away

the challenges of building joins and matching schema

constructs manually.

3.2 Embedded Integration Engine

A lightweight database engine is developed and em-

bedded in the JDBC driver implementation. It is re-

quired to parse the conceptual query, build a logical

query plan, and then execute the query plan. The op-

timizer selects the join method and a join ordering

to connect data across the data sources. The execu-

tion engine is capable of handling data sets larger than

main memory by ﬁltering and joining them efﬁciently

using secondary storage at the client.

3.2.1 Making Connections

In the sample application (Figure 5), the URL speci-

ﬁed in Line 1 references the sources.xml ﬁle con-

taining information on each data source to be inte-

grated. In Line 6, after registering the driver, the ﬁrst

task is to specify which sources to use. This is done

using a sources XML ﬁle that contains source infor-

mation such as the server, connection protocol, and

annotation. This information is used by the Unity

JDBC driver to connect to individual sources. The

programmer may dynamically change which sources

are included in the integration. When a connection

is made, two things happen. First, the Unity JDBC

driver uses the conﬁguration information to make a

connection to each data source speciﬁed. Second, the

integration algorithm is run to match and merge the

schema annotations in the XML ﬁles. The output af-

ter making a connection is an integrated, global view.

3.2.2 Conceptual Querying

Although querying is possible through SQL in this

model, it requires complete knowledge of the rela-

tionships within and between the data sources to spec-

ify joins. A simpler conceptual query language is de-

veloped that allows users to query without specifying

joins, as most users will not know all the details of an

integrated schema. The query language is an attribute

only version of SQL, where the SELECT clause con-

tains the concepts to be projected in the ﬁnal results

and the optional WHERE clause speciﬁes selection

criteria for the query. By giving each ﬁeld a mean-

ingful semantic name, users can query on the name

in the annotation instead of the explicit relation and

ﬁeld names of the individual data sources. Notice the

absence of the FROM clause, as well as, no explicit

speciﬁcation of joins. The integration system will au-

tomatically identify these elements from the attributes

speciﬁed in the conceptual query.

DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER

329

The concepts in the query (semantic names) map

to ﬁelds in the relations. In Lines 8, 9 and 10 of

the example program, the conceptual query (contain-

ing only attributes) is passed to the driver using the

standard JDBC method. In this example, the user has

queried on: Part.Name, LineItem.Quantity, and Cus-

tomer.Name. These are mapped to their locations in

the databases: Part.Name → part.p

name (Part DB),

LineItem.Quantity → lineitem.l

qty (Order DB), Cus-

tomer.Name → customer.c

name (Order DB). Users

query on the annotation (attributes of the universal

view), and the system performs query inference to

ﬁnd the required joins.

The conceptual query is parsed inside the Unity

driver and converted to a parse tree using a JavaCC

based parser. The parse tree identiﬁes all of the at-

tributes and their relations in the local data sources

for use in the generation of the integrated result.

import java.sql.*;

public class JDBCApplication

{

public static void main(String[] args)

{

String url = “jdbc:unity://sources.xml”; (1)

Connection con; (2)

// Load UnityDriver class

try { Class.forName(“unity.jdbc.UnityDriver”); } (3)

catch (java.lang.ClassNotFoundException e) { System.exit(1); } (4)

try { // Initiate connection (5)

con = DriverManager.getConnection(url); (6)

Statement stmt = con.createStatement(); (7)

ResultSet rst = stmt.executeQuery(“SELECT Part.Name, (8)

LineItem.Quantity, Customer.Name (9)

WHERE Customer.Name=’Customer

25’”); (10)

System.out.println(“Part , Quantity, Customer”); (11)

while (rst.next()) (12)

System.out.println(rst.getString(“Part.Name”) (13)

+“,”+rst.getString(“LineItem.Quantity”) (14)

+“,”+rst.getString(“Customer.Name”)); (15)

con.close(); (16)

} (17)

catch (SQLException ex) { System.exit(1); } (18)

}

Figure 5: Sample JDBC Application

3.2.3 Building Subqueries

The query’s parse tree is translated into an execution

tree consisting of operators such as select, project,

join, and sort. The joins required to relate the at-

tributes and complete the query must be identiﬁed.

There are two levels of query inference required for

querying a global schema: local and global. Lo-

https://javacc.dev.java.net

cal query inference determines the joins within a sin-

gle data source. The conceptual query specifying at-

tributes of the global schema is mapped to attributes

and relations in each data source. A join graph

G=(V, E) provides a directed graph representation of

a relational schema, where each node in V represents

a relation and each edge of E represents a foreign

key constraint between two relations. Inferring the re-

quired joins reduces to the problem of determining the

minimal edge set to connect the speciﬁed nodes (re-

lations) in the graph. The minimal set of joins repre-

sents the strongest relationship between the attributes

(Goldman et al., 1998). The inference algorithm uses

a shortest-path approximation of the Steiner tree algo-

rithm (Wald and Sorenson, 1984). The result of local

inference is a join tree for each data source.

Since there is only one attribute/relation required

from the Part database, it is straightforward to gener-

ate the join tree of one node. However, in the Or-

der database, ﬁelds from the relations customer to

lineitem are requested, but these relations do not have

a direct connection between them. Query inference

determines the shortest path which results in a join

path from lineitem to customer through the orders re-

lation. The join trees are used to build the execution

trees by adding in any additional join operators re-

quired for the query. The selection operator must be

added to the execution tree of the subquery for the

Order database to select only Customer

25 from

the customer relation. The operator execution tree for

the discussed subqueries are shown in Figure 6. The

subquery execution trees are converted into SQL sub-

queries to be executed at each local data source by

inserting the nodes in the FROM clause and the joins

in the WHERE clause. In addition, the selection crite-

ria is included in the WHERE clause. Note that both

subqueries include the additional attribute partkey,

which will be used to integrate the subqueries. The

resulting subqueries are shown below:

Subquery 1: Order Database

SELECT l

qty, c name, l partkey

FROM lineitem, customer, order

WHERE o

custkey=c custkey and l orderkey=o orderkey

name=’Customer 25’

Subquery 2: Part Database

SELECT p

name, p partkey FROM part

3.2.4 Integrating Results

Once local inference is complete and the subqueries

are created, the database engine determines how the

data extracted from each source can be related. To

combine the subquery results, global inference deter-

mines the joins across data sources by identifying and

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

330

Figure 6: Operator Execution Tree

relating global keys. A global key is a key that iden-

tiﬁes an object beyond the scope of the database. For

instance, a book ISBN is a key that can be used to

match book records across databases. In the annota-

tion of Figure 3, global keys are identiﬁed by spec-

ifying the scope where they are valid. The primary

key for the Part relation has the scope of the organi-

zation that owns the database. A global join is found

between two databases if the key ﬁelds match and the

scope of the global keys is compatible. Since both of

the databases contain the Part.Id key with a shared do-

main and a valid scope, Part.Id is the global key used

to join the subquery results. If no global joins exist,

the user could explicitly specify global join conditions

or a cross-product is performed. The plan of execu-

tion for the global query is represented in the operator

tree shown in Figure 6.

The driver implements the execution tree by exe-

cuting the two subqueries through JDBC drivers to

the data sources to retrieve both ResultSets. The Re-

sultSets are combined based on the Part.Id global

join. Notice that l

partkey and p partkey are included

in the subqueries in order to execute the global join

p partkey = l partkey. A global ResultSet is then con-

structed, which is accessed by regular JDBC methods,

as shown in lines 12-15 of the example program.

The embedded database engine uses the Result-

SetScan operator to scan in a JDBC ResultSet for

each subquery one row at a time. The subqueries

are threaded and executed in parallel. The optimizer

pushes as much work as possible to the individual

database sources, so that only the necessary global

joins and ﬁltering/sorting is performed at the client.

Joins in the embedded database engine are performed

using either sort-merge join or a variant of adaptive

hash join. The database engine will use secondary

storage to handle large joins that do not ﬁt in mem-

ory efﬁciently. The global join of the sample query

is processed by the database engine using a variant of

adaptive hash join (Zeller and Gray, 1990).

3.3 Supporting Evolution

The driver supports two types of evolution: addi-

tion/deletion of data sources and schema evolution

within a data source. Evolution is supported because

all mappings are calculated dynamically: mappings

from global concept names to ﬁelds in local sources,

joins between tables in each local database, and global

joins connecting databases. Annotating schemas be-

fore they are integrated reduces naming conﬂicts and

allows the renaming of schema constructs. Since joins

are calculated within a data source at query-time,

structural evolutions such as promoting a 1:N rela-

tionship to a M:N relationship are handled. Finally,

the conceptual query language on the global schema

only references attributes, so if the global schema

evolves, the query system will update the mappings

without affecting existing queries.

3.4 Detecting Inconsistent Data

The architecture is designed to detect inconsistent

data across data sources. It is common that names,

addresses, and other text ﬁelds may have distinct val-

ues in several databases. For every data item pro-

duced in the global result, the provenance of the data

item can be maintained. This allows a global user to

determine the data source for each data element and

the properties of that data source. It also allows the

system to detect when two data sources have differ-

ent values for the same ﬁeld and present those differ-

ences to the global user. The user can then choose

which data value to accept. Support for inconsistent

data is unique to our architecture as other data integra-

tion systems assume their is no inconsistency between

sources. Due to the cost of tracking data provenance,

the user has the option of disabling this feature.

4 EXPERIMENTAL EVALUATION

The speed of global view construction and support

for data source evolution is tested in the ﬁrst exper-

iment. In this experiment multiple TPC-H schemas

are integrated. The results in Figure 7 show the times

to integrate schemas (create the global schema), con-

nect to all databases, parse the conceptual query, and

generate the subqueries. The process for integrat-

ing schemas into a global schema is a linear time

process based on the number of schemas integrated.

The test is representative as integrated schemas will

DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER

331

likely have data elements that share a large portion of

the same domain.

Integrate Parse Total

Schemas Schemas Connect Subqueries Time

2 0.219 0.203 0.110 0.532

3 0.235 0.234 0.125 0.594

10 0.485 0.437 0.406 1.328

100 1.375 1.719 5.063 8.157

Figure 7: Integration of TPC-H Schemas in Seconds

The time to parse the global query into subqueries

has a greater complexity and processing time due to

global inference. However, the time to parse and in-

fer 100 SQL subqueries is completed in ﬁve seconds.

The test results show that the Unity driver is capa-

ble of handling rapid evolution of the global schema,

either within a single data source or by the addition

or removal of data sources, as schema rebuilding is

extremely fast. One hundred medium sized schemas

may be integrated in less than 1.5 seconds and con-

nected in three seconds. Note that the global view is

created only once, whereas the parsing of the concep-

tual query and subquery generation is done for each

global query.

In order to evaluate the query execution times for

the Unity driver duplicate copies of the partitioned

TPC-H benchmark databases with a 1 GB database

size were installed on two identical computers run-

ning Microsoft SQL Server. A third identical com-

puter executed the Java test programs, where each

time was averaged over ﬁve query executions. The

query execution times shown in Figures 8 and 9 are

described below:

• TPC-H - Conceptual query executed through Unity

driver against a single source TPC-H database.

• JDBC TPC-H - SQL query equivalent to concep-

tual query executed directly through SQL Server

JDBC driver on a single source TPC-H database.

• Partitioned Separate Computers - Conceptual

query executed on TPC-H data set partitioned into

the Part and Order databases shown in Figure 2.

• Partitioned on One Computer - Conceptual

query executed on TPC-H data set virtually parti-

tioned into the Part and Order databases shown in

Figure 2 on a single computer.

Two similar queries with contrasting result set sizes

are used to evaluate the time required for Unity

to process a conceptual query including execution

time. The large (Figure 8) conceptual query SE-

LECT p

name, l quantity, c name with equivalent

SQL query

returns 6,001,215 tuples (equivalent to

SELECT p

name, l quantity, c name FROM part,

Figure 8: Large Part-LineItem-Customer Query

Figure 9: Small Part-LineItem-Customer Query

Lineitem table size). The smaller (Figure 9) query

shown in the example program of Figure 5 generates

the same results limited to Customer

25 for a result

size of only 76 tuples

. The overhead for Unity to

convert the conceptual query (TPC-H) into the equiv-

alent SQL query listed (JDBC TPC-H) requires less

than 0.025 seconds (25 ms) for either query. This

demonstrates the nearly negligible time to pass a con-

ceptual query through the Unity driver on a single

source.

lineitem, customer, orders WHERE p partkey=l partkey

and o

orderkey=l orderkey and c custkey= c custkey

SELECT p

name, l quantity, c name FROM part,

lineitem, customer, orders WHERE p

partkey=l partkey

and o

orderkey=l orderkey and c custkey= c custkey and

name = ’Customer#000000025’

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

332

The Figure 8 query execution times for the parti-

tioned databases executed faster than the same query

executed on a single source TPC-H database. The dif-

ference is due to the fact that the embedded database

in Unity executes one of the joins on the client ma-

chine, as opposed to the server performing all three

joins. The query executing all three joins on the server

involves the join of the large Lineitem relation requir-

ing full use of the CPU on the server. For the parti-

tioned databases Unity is able to start the global join

once the smaller ResultsSet from the Part relation is

received. This allows for the embedded database in

Unity to work in parallel with the server yielding an

improved execution time for this particular query.

The results in Figure 9 show a longer execution

time for integration versus full execution on the single

source TPC-H database. The increased time is due to

the time to transport to the client the entire Part rela-

tion (1.3 seconds) from the Part database in order to

complete the global join. The single source query im-

ports only the 76 tuples in the ﬁnal result. Also, the re-

sults for the integration of the partitioned data located

on one computer takes slightly longer than data parti-

tioned on different computers, as the same resources

are shared on a single computer.

5 CONCLUSION

The Unity JDBC driver provides an automatic and

scalable approach to integrate and then query multiple

data sources. The JDBC interface provides standard

methods to access the data. The ability to quickly re-

compute the global view allows for dynamic integra-

tion of a large number of databases. The lightweight

database engine embedded in the driver integrates the

data from multiple databases transparently. The addi-

tion of the conceptual query language allows queries

to be speciﬁed on the global view without the require-

ment of understanding the structure of each under-

lying schema. Experimental results show that this

approach causes minimal overhead in the querying

process. In addition, the driver efﬁciently executes

the queries by identifying the subqueries to execute

on the servers and using the client to complete the

integration. This unique approach allows integration

to be more automated, scaleable, and rapidly deploy-

able.

Future work will investigate a more powerful

global query optimizer. By obtaining information

about the data sources including selectivity and re-

lation size, the global join strategy could be opti-

mized. In addition, strategies to effectively implement

GROUP BY queries will be examined.

REFERENCES

Collet, C., Huhns, M., and Shen, W.-M. (1991). Resource

Integration Using a Large Knowledge Base in Carnot.

IEEE Computer, 24(12):55–62.

Covitz, P., Hartel, F., Schaefer, C., Coronado, S., Fragoso,

G., Sahni, H., Gustafson, S., and Buetow, K. (2003).

caCORE: A common infrastructure for cancer infor-

matics. Bioinformatics, 19(18):2404–2412.

Decker, S., Erdmann, M., and Studer, R. (1998). ONTO-

BROKER: Ontology based access to distributed and

semi-structured information. In Database Semantics

- Semantic Issues in Multimedia Systems, volume 138

of IFIP Conference Proceedings. Kluwer.

Dragut, E. and Lawrence, R. (2004). Composing map-

pings between schemas using a reference ontology. In

ODBASE.

Goh, C., Bresson, S., Madnich, S., and Siegel, M. (1999).

Context Interchange: New Features and Formalisms

for the Intelligent Integration of Information. ACM

Transactions on Information Systems, 17(3):270–293.

Goldman, R., Shivakumar, N., Venkatasubramanian, S., and

Garcia-Molina, H. (1998). Proximity Search in Data-

bases. In VLDB, pages 26–37.

Haas, L., Lin, E., and Roth, M. (2002). Database Integration

through Database Federation. IBM Systems Journal,

41(4):578–596.

Halevy, A. (2001). Answering queries using views: A sur-

vey. VLDB Journal, 10(4):270–294.

Kirk, T., Levy, A., Sagiv, Y., and Srivastava, D. (1995). The

Information Manifold. In AAAI Spring Symposium on

Information Gathering.

Lenzerini, M. (2002). Data Integration: A Theoretical Per-

spective. In PODS, pages 233–246.

Li, C., Yerneni, R., Vassalos, V., Garcia-Molina, H.,

Papakonstantinou, Y., Ullman, J., and Valiveti, M.

(1998). Capability Based Mediation in TSIMMIS. In

ACM SIGMOD, pages 564–566.

Rahm, E. and Bernstein, P. (2001). A survey of ap-

proaches to automatic schema matching. VLDB Jour-

nal, 10(4):334–350.

Sheth, A. and Larson, J. (1990). Federated Database Sys-

tems for Managing Distributed, Heterogenous and

Autonomous Databases. ACM Computing Surveys,

22(3):183–236.

Ullman, J. (1997). Information Integration Using Logical

Views. In ICDT’97, volume 1186 of LNCS, pages 19–

40.

Wald, J. and Sorenson, P. (1984). Resolving the Query

Inference Problem Using Steiner Trees. TODS,

9(3):348–368.

Zeller, H. and Gray, J. (1990). An adaptive hash join al-

gorithm for multiuser environments. In VLDB 1990,

pages 186–197.

DYNAMIC DATABASE INTEGRATION IN A JDBC DRIVER

333