SEMANTIC QUERY TRANSFORMATION FOR

INTEGRATING WEB INFORMATION SOURCES

Mao Chen, Rakesh Mohan, and Richard T. Goodwin

IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA

Keywords: Information integration, Query transformation, Semantic information, Ontology, Web services

Abstract: The heterogeneousness and dynamics of web information sources are the major challenges to Internet-scale

inform

ation integration. The information sources are different in contents and query interfaces. In addition,

the sources can be highly dynamic in the sense that they can be added, removed, or updated with time. This

paper introduces a novel information integration framework that leverages the industry standards on web

services (WSDL/SOAP) and ontology description language (RDF/OWL), and a commercial database (IBM

DB2 Information Integrator⎯DB2 II (DB2 II)). Taking advantage of the data integration and query

optimization capability of DB2 II, this paper focuses on the methodologies to transform a user query to the

queries on different sources and to combine the transformation results into a query to DB2 II. By wrapping

information sources using web services and annotating them with regard to their contents, query capabilities

and the logical relations between concepts, our query transformation engine is rooted in ontology-based

reasoning. To the best of our knowledge, this is the first framework that uses web services as the interface of

information sources and combines ontology-based reasoning, web services, semantic annotation on web

services, as well as DB2 II to support Internet-scale information integration.

1 INTRODUCTION

Efficient information integration from various

sources is critical to Internet-scale business systems.

In contrast to traditional full-fledged and stable

information sources such as databases, web

information sources are distinct in their

heterogeneity and dynamics. First, web sources are

heterogeneous in content hence a single information

source usually provides only part of the answer for a

user query. In addition, web sources have different

query capabilities that are reflected in the various

query schemas. Furthermore, web sources are highly

dynamic in the sense that new sources are added

continuously, old ones may become unavailable, and

existing ones are updated frequently in terms of both

the query interface and the contents.

The web service technology (W

3C ’02) provides

a machine-usable interface to wrap the information

sources that are conventionally accessible only via

human-understandable query forms. Via a web

service wrapper, any structured databases, file

systems, unstructured web pages and other

information sources can be treated equally in

Internet-scale information integration.

This paper proposes a novel framework for

formation integration from heterogeneous and

dynamic sources. Our framework leverages industry

standards on web service and ontology, and an IBM

database system. Namely, IBM DB2 Information

Integrator (DB2 II) acts as the back end for hosting

information from various sources and generating

optimized query plan to the sources.

The key challenge in the proposed framework is

trans

forming a user query to a valid DB2II query.

Our query transformation mechanism consists of

two phases. Phase 1 customizes a user query into the

queries to different sources. The transformation

results are used in the second phase to generate a

query as an input to DB2 II. The corner stone of our

query transformation algorithm is ontology-based

reasoning. Ontology is used to describe user’s view,

the query schemas of the web services, and the

relations between different concepts.

The major contributions of this paper are three

fol

ds:

1.) Pr

oposing a novel framework for Internet-

scale information integration using web

services, ontology technology and

commercial databases;

176

Chen M., Mohan R. and T. Goodwin R. (2005).

SEMANTIC QUERY TRANSFORMATION FOR INTEGRATING WEB INFORMATION SOURCES.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 176-181

DOI: 10.5220/0002528801760181

 SciTePress

2.) Proposing a set of reasoning rules for

transformation between different schemas;

3.) Presenting an ontology-based annotation

scheme for describing query interfaces of

web services which can be an extension of

OWL-S/DAML-S (DAML, Burstein ‘02).

2 RELATED WORK

Integrating information from heterogeneous sources

has been an important problem in very large

databases management (Arens ’96, Genesereth ’97,

Gio ’00,

Madhavan ’03). The integration systems

can be classified as query-centric and source-centric.

The query-centric systems choose a set of users’

queries and provide the procedure to customize

those queries for the available sources (TSIMMIS

’94, HERMES ’95). As a representative of source-

centric systems, InfoManifold describes sources’

contents and query capabilities, and transforms each

new query based on the descriptions (Levy ’96).

Both types of systems focus on query planning

optimization using certain criteria, but use light-

weight transformation between different concept

spaces. Our work is distinct from the previous

efforts in three ways.

First, the query plans generated by these

integration systems are usually not optimized at the

execution level. In contrast, many commercial

databases such as IBM DB2 II have powerful query

planning engines that use sophisticated algorithms

based on execution cost, statistics on usage, and

other parameters as regard to the running

environment (Haas ‘97). Our methodology takes

advantage of the query optimization capabilities of

DB2 II therefore guarantees efficient query

execution in run time.

Information Integrator

User query

Query

generator

DB 2 II

WS1

WS2

WSn

Web source 1

Web source 2

Web source n

Internet

Web service 1

Web service 2 Web service n

Query

transformation

engine (QTE)

Ont.

Know.

Directory

The second distinction between our work and the

previous work is the transformation mechanism.

The transformation in the previous work is light-

weight. Bussler et. al. indicate that combining

ontology technology and web service technology is

important for making web information machine-

processable (Bussler ’02). Based on this idea, our

information integration framework uses ontology-

based reasoning to handle discrepancy between

different concept spaces.

Finally, the traditional systems usually rely on ad-

hoc wrapper languages and models, which makes

adding or changing services in such an integration

system a heavy burden on the service provider side

(TSIMMIS ’96). Since web services can be added or

removed without recoding the integration engine and

the wrappers, our framework is best suited for the

dynamic environment such as web.

3 ARCHITECTURE OF OUR

INFORMATION INTEGRATION

SYSTEM

Figure 1 outlines the conceptual architecture of our

information integration system. A user can query the

integration system through SQL statement as to a

conventional database. Each web source is wrapped

and presented using a web service that is mapped to

a virtual table in DB2 II. Using DB2 II built-in

capability for federating web services, the

integration system transforms a user query to queries

to web services, integrates results from the web

services, and returns the integrated result to user.

Our integration system consists of three

functional modules. The front end of our integration

system has a query transformation engine (QTE) and

a query generator. QTE is in charge of customizing a

user query into the valid queries of the web services.

Based on the transformation result, the query

generator creates a valid DB2 II query on all the

related web services and triggers DB2 II with the

query. At the back end of our integration framework

sits IBM DB2 II. DB2 II generates optimized

executable query plan that calls all the related web

services and returns the aggregated results to users.

Figure 1: Architecture of our information integration

system for web sources

Our query transformation (by QTE) and query

generation (by query generator) are accomplished

based on two types of knowledge. The first type of

knowledge is semantic information about the

services. The knowledge source “Ontology” stores

the query capability of each service and the relations

SEMANTIC QUERY TRANSFORMATION FOR INTEGRATING WEB INFORMATION SOURCES

177

between different concepts. The “Knowledge base”

holds the information that cannot be described using

ontology, for example, the mathematical relations

between the concepts. The second type of

knowledge is about web services. The “Directory”

provides registry service to web services and updates

the virtual tables of web services in DB2 II. We

envision that the directory service can be

implemented by enhancing semantic UDDI service

(UDDI) as proposed in many works (Akkiraju ’03).

Given the query optimization capability of DB2

II, the major challenges of the above infrastructure

include annotating web services about their query

capabilities, automatically transforming user query

to the valid query for each web service, and

generating an executable query plan for DB2 II. The

next section presents our mechanisms to deal with

the three issues.

4 SEMANTICS-BASED QUERY

TRANSFORMATION

This study uses a used-car searching service as an

application scenario to introduce our information

integration framework. Given a user query on used

car information, this service intelligently inquires

and integrates the results from three sites, Yahoo

Autos (Yahoo), Autos MSN (MSN) and Kelly’s

Blue Book (KBB). Yahoo and MSN provide on-line

retailing and auction information about the used

cars, and KBB is an authority site that provides a

suggested retail price for a car when given car

information such as make, model, and year.

A user’s concept space about used car

information includes two parts: the query and the

result. A user can search for used cars based on

user’s location, searching area, make and model,

year, mileage and price. The most interesting results

to a user are year, mileage, asked price, KBB

suggested price.

Our information integration system aims at

transforming an SQL-like user query as follows:

SELECT * FROM car

WHERE make = ‘Acura’ AND price <= 15000

Into a valid query of DB2 II that stores the

aforementioned web services:

SELECT make, model, mileage, price

FROM YahooAuto

WHERE make=‘Acura’ AND maxprice=15000

UNION ALL

SELECT make, model, mileage, price

FROM MSNCars

WHERE category = ‘Passenger Cars’ AND

make = ‘Acura’ AND price = 15000

“Union” links queries each of which is valid to a

web service. The final combined DB2 II query is

formed based on the relations among the user’s

query, the query capability and the contents of each

web service.

4.1 Describing Web Services as

Ontology

We annotate the semantic information about web

services using Protégé ontology editor and

knowledge acquisition system (Protégé-2000). The

resulting ontology is represented as RDFS and RDF.

A web service is an instance of the class “web

source” which has three properties: the service

name, the query class (input schema), and the output

class (output schema). Tables 1 and 2 show the

query class and the output class for Yahoo.

Table 1: Query class of Yahoo

Properties Range Required

User Position {User Location} Yes

Search Within {Search Area} No (50 miles)

Car Make {Manufacture} No

Car Model {Model} No

Mileage LessThan {Car Mileage} No

Mileage MoreThan {Mileage} No (0 mile)

Year LessThan {Car Year} No (2004)

Year MoreThan {Car Year} No (1940)

Price Range {Price Range} No

Table 2: Output class of Yahoo

Properties Range Required

Asked Price {Car Price} Yes

Mileage Is {Mileage} Yes

Car Type {Make Model} Yes

Car YearIs {Car Year} Yes

The symbols in the braces refer to class, and

those in the brackets are the default values. Table 1

also shows that only the user position is required by

Yahoo Autos. Autos MSN and Kelly’s Blue Book

have different input and output schemas from Yahoo

Autos which are not shown due to the space limit.

4.2 Transforming a User Query to the

Queries to the Web Services

This section presents the solutions for seven types of

schema mismatch. The first four rules handle two

pairs of dual transformations for abstract model and

instance model. The fifth and the sixth rules are for

transformation between different abstract models.

The last rule handles the mismatches in searchable

attributes at both abstract and instance levels.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

178

4.2.1 Concept Mapping

One of the most common difficulties in dealing with

heterogeneous schemas is that a same concept has

different names in different sources. This mismatch

can be handled using concept mapping or renaming.

In this study, renaming is done by mapping different

names to a common concept using “RDFS:range”.

For example, two equivalent concepts “Yahoo User

Location” and “MSN User at” can be mapped to the

same class “User Location”.

If using ontology description language OWL

(OWL 2004), one can use “OWL:EqualProperty” to

indicate the equivalence of the above two properties.

4.2.2 Instance Mapping

In practice, same instance may have different names

in different sources. For example, “New York” and

“NY” refers to the same state instance. Instance

mapping is an analogue to Concept Mapping.

Instance mapping can be achieved by using

“OWL:sameAs” description. The following example

shows the equivalence of “New York” and “NY”:

<owl:sameAs rdf:resource="#NY" />

</UsedCar>

4.2.3 Concept Folding

Different sources may allow queries at different

levels of granularity for a given attribute. For

example, Kelly’s Blue Book requires queries on

“Car Type” which combines “Manufacture” and

“Model” as a single attribute, while Yahoo allows

queries to specify “Make” and “Model” separately.

We call the transformation from fine-grained level to

a coarser-grained level as concept folding.

Using RDFS, concept folding can be achieved by

annotating fine-grained concepts as properties of the

coarse-grained concept. In OWL, the two concepts

“Make” and “Model” can be defined as “sub

property” of the property “Make Model”.

4.2.4 Instance Folding

Different from Concept Folding that merges fine-

grained concepts into an equivalent single concept,

Instance Folding or Concept Expanding extends an

instance into a more general instance.

Assume a user’s query includes two parameters

“Make” and “Model”, but a web service like MSN

supports car searching only on “Car Category”. A

car category includes many car types hence query

transformation needs to extend a specific car type

searching into a more general category searching.

We define the class “Car category” with two

properties that are “Make” and “Model”. The

relation between each category and each pair of

make and model is described by the instances in a

RDF file, as shown in figure 2. With this knowledge,

one can transform a user’s query such as

Where Make = Acura” and Model = “CL”

Into the following query to MSN:

Where Car Category = “Passenger Cars”

Make = “Acura” Model = “CL”

Car Category = “Passenger Cars”

Figure 2: Instance folding of “Acura” and “CL”

Instance folding loosens the searching criteria for

maximizing the usage of all the related sources,

therefore the results should be filtered based on the

original user request.

4.2.5 Inequality Inference for Concepts

Generally speaking, a web service may not offer a

full set of comparison operators for an attribute, but

a user’s query may consist of any comparison

operator. Limited query capability is a fundamental

difference of web service from databases.

For the same attribute, some web services accept

equality queries, while others use range searching.

For a range searching, a service may allow the range

to have one open-end or with both ends open.

Therefore the semantic analysis on each service’s

query capability with inequality queries is necessary.

For transforming a user requested comparison

operator to an available operator to a web service,

we identify a complete set of transformations

between any pair of comparison operators that

include <, <=, =, >=, and >. For example, when a

user’s query includes “< N” for an attribute A and a

service allows only equality searching on A, the

user’s query can be transformed into “{< Max + 1} -

{< N +1}” where {} – {} denotes set difference.

In this study, the semantic meaning of inequality

query capability is annotated using property name.

For example, the class “Car Price Range” has two

properties “Price Less Than” and “Price Greater

Than” that describe a range searching on car price

with two open ends. The semantic meaning of the

comparison operators “>” and “<” are encoded as

“Greater Than” and “Less Than”. A user’s query

including “Where price < 20000” is transformed as

“Price Less Than = 20000” in the query to the

corresponding web services.

SEMANTIC QUERY TRANSFORMATION FOR INTEGRATING WEB INFORMATION SOURCES

179

4.2.6 Mathematical Reasoning for Concepts

Not all relations between concepts can be described

using ontology language. One example is that

neither RDFS nor OWL can represent the

mathematical relations between the concepts.

For example, MSN accepts queries on car’s age,

while Yahoo allows searching a car based on the

upper bound and the lower bound of a car’s

production year. A mathematical transformation is

required between the two concepts “Car age” and

“Year MoreThan” using constant “current year”:

Year MoreThan = Current Year – Car age

4.2.7 Mismatch Handling for Attributes

There are two reasons for the attributes specified in a

user query to be unsearchable in a web service. The

first reason is that the attribute set in user’s query

does not match that is used by a web service, which

is called “domain mismatch” in this paper. Another

reason is that the range of an attribute in a user query

is different from that in a web service, which is

referred as “range mismatch” in this paper.

In domain mismatch, the web service requires

some attributes that are not specified in the user’s

query, or on the opposite, an attribute in the user’s

query is not part of the query schema for a web

service. In the former case, the value of the required

attribute by the web service can be defaulted, or

alternatively, the query is run with each possible

value of the required attribute. In the latter case, the

attribute in the user’s query must be ignored when

generating the query to the web service. This will

return a super set of the requested results. If the

ignored attribute is part of the result schema in the

web service, post processing can filter out the results

that do not match the user’s constraint. Default value

can be annotated using “a:defaultValues” in RDFS.

One scenario for range mismatch is that web

service requires enumerated values for an attribute,

which can be annotated using “OWL:one of”. To

deal with the “range mismatch”, the value of an

attribute in a user’s query should be mapped to the

closest valid value for the web service so that the

result from the web service is a superset of the result

of the original user query. The results should be

filtered based on the original user’s query.

4.3 Generating Query to DB2 II

After a user’s query is transformed to queries to the

web services, the query generator in Figure 1

generates a DB2 II query on multiple web services.

The query generation consists of three steps.

The first step is identifying all the related web

services to a given user query. A web service is

related if its output schema overlap the result

schema of the user query, and its required attributes

can be satisfied with the user’s query.

The second step is to group the services which

output schemas are consistent. We call two schemas

are consistent if they are equivalent or one schema

contains the other. The resulting schema of a service

group is the intersection of the output schemas of all

the services in the group. The results from the web

services in a same service group are merged using

the statement “UNION ALL”.

The last step is to deal with the case that the

output of one service group is complementary to that

of another group. The query generator joins the

results of those service groups.

4.4 Example of Transforming a User

Query to a DB2 II Query

Assume DB2 II integrates three web services, Yahoo

Autos (Yahoo), Autos MSN (MSN) and Kelly’s

Blue Books (KBB) and a user’s query is as follows:

SELECT * from car

WHERE Make = Acura

and Model = CL

and Year < 8

and Price < 20000

and Price > 10000

and Mileage < 70000

and Location = 10598

We first create two virtual tables each of which is

defined using a WITH statement. The first group

includes KBB only and provides KBB Suggested

Price that is not available from other service groups.

The second group merges the results of Yahoo and

MSN using the UNION ALL statement. The grey

fields in the statement refer to the default

values.

WITH cars_0 (year, kbb_price, car_type) AS

(SELECT KBB_CarYearIs, KBB_SuggestedPrice,

KBB_CarTypeIs

FROM KBB

WHERE KBB_CarType.Car_Make =

Acura,KBB_CarType.Car_Model = CL)

WITH cars_1 (year,price,mileage,car_type) AS

((SELECT Yahoo_CarYearIs, Yahoo_AskedPriceIs,

Yahoo_CarMileageIs, Yahoo_CarType

FROM Yahoo

WHERE Yahoo_Car_Make = Acura AND

Yahoo_Car_Model = CL AND

Yahoo_MileageLessThan = 70000 AND

Yahoo_MileageMoreThan = (0) AND

Yahoo_PriceRange.PriceLessThan =

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

180

20000,Yahoo_PriceRange.PriceMoreThan = 10000

AND Yahoo_SearchWithin = (50) AND

Yahoo_UserPosition = 10598 AND

Yahoo_YearLessThan = (2004) AND

Yahoo_YearMoreThan = 1996)

UNION ALL

(SELECT MSN_YearIs, MSN_AskedPriceIs,

MSN_MileageIs, MSN_CarTypeIs

FROM MSN

WHERE MSN_CarAgeLessThan = 8 AND

MSN_CarCategory = PassengerCars AND

MSN_CarType.Car_Make =

Acura,MSN_CarType.Car_Model = CL AND

MSN_MileageLessThan = 70000 AND

MSN_PriceRange.PriceLessThan =

20000,MSN_PriceRange.PriceMoreThan = 10000

AND MSN_SearchWithin = (100) AND MSN_UserAt

= 10598))

Finally, a SELECT statement joins the results

from two virtual tables (service groups).

SELECT c0.year, c0.kbb_price, c0.car_type, c1.year,

c1.price, c1.mileage, c1.car_type

FROM cars_0 c0, cars_1 c1

WHERE c0.year = c1.year AND c0.car_type =

c1.car_type

5 CONCLUSION

We have proposed a novel information integration

framework that uses web service as the wrapper to

represent heterogeneous web information sources.

Our framework is built upon industry standards such

as WSDL/SOAP and Ontology languages RDFS and

OWL, and leverages the service federation and the

query optimization capabilities of IBM DB2 II.

Using a used car searching service as the application

scenario, we present a set of ontology-based

transformation rules to deal with schema and content

heterogeneity of web sources. Our future work is

addressing scalability issues in our framework and

methodologies.

REFERENCES

Akkiraju, R., Goodwin, R., Doshi, P., and Roeder, S.,

2003. “A Method for Semantically Enhancing the

Service Discovery Capabilities of UDDI”. In the

workshop Proc. of 18th IJCAI 2003. Information

Integration on the Web, 87-92

Arens, Y., Knoblock, C. A., and Shen, W., 1996. “Query

reformulation for dynamic information integration”.

Journal of Intelligent Information Systems, 1996.

Burstein, M. H., Hobbs, J. R., Lassila, O., Martin, D.,

McDermott, D. V., McIlraith, S. A., Narayanan, S.,

Paolucci, M., Payne, T. R., Sycara, K. P., 2002.

“DAML-S: Web Service Description for the Semantic

Web”. In Proceedings of International Semantic Web

Conference 2002: 348-363

Bussler, C., Fensel, D., and Maedche, A., 2002. A

Conceptual Architecture for Semantic Wb Enaled Web

Services. In ACM SIGMOD Record, Vol. 31, No. 4,

December 2002.

http://www.daml.org/services/owl-s/

DB2 Information Integration. http://www-

306.ibm.com/software/data/integration/.

Genesereth, M. R., Keller, A. M., and Duschka, O. M.,

1997. “Infomaster: An information integration system”.

In Proc. of SIGMOD, 1997.

Gio, W., 2000. "Future Needs in Integration of

Information". In International Journal of Cooperative

Systems, Vol. 9, No.4, World Scientific Publishing,

November 2000, pages 449-772.

Haas, L. M., Kossmann, D., Wimmers, E. L., and Yang, J.,

1997. “Optimizing Queries Across Diverse Data

Sources”. VLDB (1997): pp 276-285

Subrahmanian, V. S., Adali, S., Brink, A., Lu, J.,

Rajput,

A., Rogers, T. J., Ross, R., and Ward, C., 1995.

“HERMES: A heterogeneous reasoning and mediator

system”. Technical report, University of Maryland,

1995.

Levy, A. Y., Rajaraman, A., and Ordille, J., 1996.

“Querying Heterogeneous Information Sources Using

Source Descriptions”. In Proc. of VLDB, 1996.

Madhavan, J. and Halevy, A. Y., 2003. “Composing

Mappings Among Data Sources”. In Proc. Of VLDB

2003, pages 572 - 583.

OWL Web Ontology Language

Reference. http://www.w3.org/TR/2004/REC-owl-ref-

20040210/

Protégé ontology editor and knowledge acquisition

system. http://protege.stanford.edu/

Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K.,

Papakonstantinou, Y., Ullman, J., and Widom, J., 1994.

“

The TSIMMIS Project: Integration of Heterogeneous

Information Sources”. In Proceedings of

16th Meeting

of the Information Processing Society of Japan, 1994.

Garcia-Molina, H., Papakonstantinou, Y., Quass, D.,

Rajararnan, A., Sagiv, Y., Ullman, J., Vassalos, V., and

Widom, J., 1996. “The TSIMMIS Approach to

Mediation: Data Models and Languages”. Journal of

Intelligent Information Systems,

8 (2), 1997, 117-132,

March - April.

UDDI Technical Committee. “Universal Description,

Discovery and Integration (UDDI)”. http://www.oasis-

open.org/committees/uddi-spec/

Web Services Activity. http://www.w3c.org/2002/ws/.

SEMANTIC QUERY TRANSFORMATION FOR INTEGRATING WEB INFORMATION SOURCES

181