COMBINING INFORMATION EXTRACTION AND DATA

INTEGRATION IN THE ESTEST SYSTEM

Dean Williams

School of Computer Science and Information Systems, Birkbeck College, University of London

Malet Street, London WC1E 7HX, U.K.

Alexandra Poulovassilis

School of Computer Science and Information Systems, Birkbeck College, University of London

Malet Street, London WC1E 7HX, U.K.

Keywords:

Information Extraction, Data Integration, ESTEST.

Abstract:

We describe an approach which builds on techniques from Data Integration and Information Extraction in

order to make better use of the unstructured data found in application domains such as the Semantic Web

which require the integration of information from structured data sources, ontologies and text. We describe the

design and implementation of the ESTEST system which integrates available structured and semi-structured

data sources into a virtual global schema which is used to partially conﬁgure an information extraction process.

The information extracted from the text is merged with this virtual global database and is available for query

processing over the entire integrated resource. As a result of this semantic integration, new queries can now

be answered which would not be possible from the structured and semi-structured data alone. We give some

experimental results from the ESTEST system in use.

1 INTRODUCTION

The Semantic Web requires us to be able to integrate

information from a variety of sources, including un-

structured text from web pages, semi-structured XML

data, structured databases, and metadata sources such

as ontologies. Other applications exist which also

need to make used of heterogeneous data that is struc-

tured to varying degrees as well as related free text.

For example, in UK Road Trafﬁc Accident reports

data in a standard structured format is combined with

free text accounts in a formalised subset of English;

in crime investigation operational intelligence gath-

ering, textual observations are associated with struc-

tured data relating to people and places; and in Bioin-

fomatics structured databases such as SWISS-PROT

(Bairoch et al., 2000) include comment ﬁelds contain-

ing related unstructured information.

Data integration systems provide a single virtual

global schema over a collection of heterogeneous

data sources that facilitates global queries across the

sources (Lenzerini, 2002; Halevy, 2003). Such sys-

tems are able to integrate data occurring in a vari-

ety of structured and semi-structured formats but, to

our knowledge, they have not so far attempted to in-

clude unstructured text. In Information Extraction

(IE) systems, pre-deﬁned entities are extracted from

text and this data ﬁlls slots in a template using shal-

low NLP techniques (Appelt, 1999). Data integra-

tion and IE are therefore complementary technolo-

gies and we argue that a system that combines them

can provide a basis for applications that need to in-

tegrate information from text as well as structured

and semi-structured data sources. Our ESTEST sys-

tem integrates the schemas of structured data sources

with ontologies and other available semi-structured

data sources, creating a virtual global schema which

is then used as the template and a source of named

entities for a subsequent IE phase against the text

sources. Metadata from the data sources can be used

to assist the IE process by semi-automatically creat-

ing the required input to the IE modules. The tem-

plates ﬁlled by the IE process will result in a new

data source which can be integrated with the virtual

global schema. The resulting extended virtual global

database can subsequently used for answering global

queries which could not have been answered from the

structured and semi-structured data alone.

The rest of this paper is structured as follows: Sec-

tion 2 describes the architecture of ESTEST and use

of existing data integration and IE software. Section 3

give the design and implementation of our ESTEST

system, alongside an example that illustrates its us-

age. Section 4 presents results from initial experi-

Williams D. and Poulovassilis A. (2006).

COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM.

In Proceedings of the First International Conference on Software and Data Technologies, pages 13-21

DOI: 10.5220/0001315500130021

 SciTePress

ments in the Road Trafﬁc Accident application do-

main. Section 5 compares and contrasts our approach

with related work. Finally, in Section 6 we give our

conclusions and plans for future work.

2 BACKGROUND

The ESTEST Experimental Software to Extract Struc-

ture from Text system is implemented as a layer over

the AutoMed (AutoMed Project, 2006) data integra-

tion toolkit and the GATE (Cunningham et al., 2002)

IE framework, making use of their facilities. We now

brieﬂy describe GATE and AutoMed.

AutoMed. In data integration systems, several data

sources, each with an associated schema, are inte-

grated to form a single virtual database with an asso-

ciated global schema. Data sources may conform to

different data models and therefore need to be trans-

formed into a common data model as part of the in-

tegration process. AutoMed is able to support a vari-

ety of common data models by providing graph-based

metamodel, the Hypergraph Data Model (HDM). Au-

toMed provides facilities for specifying higher-level

modeling languages in terms of this HDM e.g. re-

lational, entity-relational, XML. These speciﬁcations

are stored in AutoMed’s Model Deﬁnitions Reposi-

tory (MDR). A generic wrapper for each data model

is provided, with specialisations for interacting with

speciﬁc databases or repositories e.g. a set of rela-

tional wrappers for interacting with the common re-

lational DBMS. The schemas of such data sources

can be extracted by the appropriate wrapper and are

stored in AutoMed’s Schemas & Transformations

Repository (STR). AutoMed provides a set of prim-

itive transformations that can be applied to schemas

e.g. to add, delete and rename schema constructs.

AutoMed schemas are therefore incrementally trans-

formed and integrated by sequences of such transfor-

mations (termed pathways).

Add and delete transformations are accompanied

by a query (expressed in a functional query lan-

guage, IQL (A.Poulovassilis, 2004)) which speciﬁes

the extent of the added or deleted construct in terms

of the rest of the constructs in the schema. These

queries can be used to translate queries or data along a

transformation pathway (McBrien and Poulovassilis,

2003). In particular, queries expressed in IQL can

be posed on a virtual integrated schema, are refor-

mulated by AutoMed’s Query Processor into relevant

sub-queries for each data source, and are sent to the

data source wrappers for evaluation. The wrappers in-

teract with the data sources for the evaluation of these

sub-queries, and with the Query Processor for post-

processing of sub-query results.

The queries supplied with primitive transforma-

tions also provide the necessary information for these

transformations to be automatically reversible, e.g. an

add transformation is reversed by a delete transforma-

tion with the same arguments. AutoMed is therefore

deﬁned as a both-as-view (BAV) data integration sys-

tem in (McBrien and Poulovassilis, 2003) which gives

an in-depth comparison of BAV with the other major

data integration approaches, Global-As-View (GAV)

and Local-As-View (LAV) (Lenzerini, 2002).

One of the main advantages of using AutoMed for

ESTEST rather than a GAV or LAV-based data in-

tegration system is that, unlike GAV and LAV sys-

tems, AutoMed readily supports the evolution of both

source and integrated schemas. This is because it al-

lows transformation pathways to be extended, so that

if a new data source is added, or if a data source

or integrated schema evolves, then the entire integra-

tion process does not have to be repeated and instead

the schemas and transformation pathways can be ‘re-

paired’.

As a pre-requisite for the development of ESTEST

we have made two extensions to the AutoMed toolkit

which were required for ESTEST but which are also

more generally applicable. Firstly, we have extended

AutoMed to include support for data models used to

represent ontologies. We have modeled RDF (Las-

sila and Swick, 1999) graphs and associated RDFS

(Brickley and Guha, 2004) schemas in the HDM.

A corresponding AutoMed wrapper for such data

sources has been implemented using the JENA API

(McBride, 2002). Although only RDF/S data sources

are currently supported, the use of JENA means that

additional ontology models can easily be added as

specialisations of the current wrapper, similarly to

the specialised AutoMed relational wrappers for spe-

ciﬁc RDMBS. Secondly, we have a developed the Au-

toMed HDM Store, a native HDM repository that is

used for storing the data that is extracted by ESTEST

from text sources.

GATE. The GATE system (Cunningham et al.,

2002) provides a framework for building IE applica-

tions. It includes a wide range of standard compo-

nents and also allows the integration of bespoke com-

ponents. Applications are assembled as pipelines of

components which are used to process collections of

documents. Applications can be built and run either

as standalone Java programs or through the GATE

GUI. A pattern matching language called JAPE (Cun-

ningham et al., 2000) is provided for constructing ap-

plication speciﬁc grammars. Some standard JAPE

grammars are provided with GATE e.g. for ﬁnding

names of people in text. The result of running an ap-

plication is a collection of annotations over a text. For

example in the string “RAN IN FRONT OF BUS”

an annotation might state that from the seventeenth to

nineteenth character there is a reference to a public

service vehicle.

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

3 THE ESTEST SYSTEM

The ESTEST system supports an incremental ap-

proach to integrating text with structured and semi-

structured data sources, whereby the user iterates

through a series of steps as new information sources

need to be integrated and new query requirements

arise. The ESTEST approach is described in

(Williams and Poulovassilis, 2004; Williams, 2005).

This paper describes an implementation of that ap-

proach, as well as giving some experimental results.

We use as our running example, a simple example

application based on Road Trafﬁc Accident reports.

In the UK, accidents are reported using a format

known as STATS-20 (UK Department for Transport,

1999). In STATS-20, a record exists for each accident,

following which there are one or more records for the

people and the vehicles involved in the accident. The

majority of the schema consists of coded entries, and

detailed guidance as to what circumstances each of

the codes should be used accompanies these. A tex-

tual description of the accident is also reported, ex-

pressed in a stylised form of English. An example

of the textual description collected for a speciﬁc ac-

cident might be “FOX RAN INTO ROAD CAUS-

ING V1 TO SWERVE VIOLENTLY AND LEAVE

ROAD”, where “V1” is short for “vehicle 1” and is

understood to be the vehicle which caused the acci-

dent. The schema of the structured part of the STATS-

20 data is well designed and there have been a number

of revisions during its several decades of use. How-

ever, there are still queries that cannot be answered

via this schema alone.

In our running example we suppose that an anal-

ysis of the road trafﬁc accidents caused by animals

is required, including the kind of animal causing the

obstruction.

Figure 1 illustrates the main components of the ES-

TEST system and each of these is described in turn in

sections 3.1 to 3.5 below.

3.1 Integrate Data Sources

The ESTEST Integrator component is conﬁgured with

the collection of data sources available to the user.

Each of these has an associated ESTEST Wrapper in-

stance (there is an ESTEST Wrapper for each data

model supported by AutoMed). The ESTEST Wrap-

per makes use of the corresponding AutoMed wrap-

per in order to construct a data source schema within

the AutoMed STR. However, the ESTEST Wrapper

is also able to extract additional metadata from the

data source, which is stored in the ESTEST Metadata

Repository — see Figure 1. The ESTEST wrapper

also transforms the AutoMed representation of a data

source schema into the ESTEST data model described

below.

Integrate Data

Sources

Create Data to

assist the IE

process

IE

Configuration

Data

Information

Extraction (IE)

Integrate Results

of IE

Global

Schema

Extracted

Data

Query Global

Schema

Enhance Schema

Information

Control Flow

Data Flow

ESTEST

Metadata

Repository

Figure 1: Overview of the ESTEST system.

The ﬁrst step for the Integrator is to iterate through

the data sources and use the associated wrapper to

construct an initial schema within the AutoMed STR.

In our example, two data sources are assumed to

be available: AccDB is a relational database holding

the relevant STATS-20 data and AccOnt is a user-

developed RDF/S ontology concerning the type of

obstructions which cause accidents. Figure 3 shows

the AccOnt RDFS schema and some associated RDF

triples. The AccDB database consists of three tables:

accident(acc_ref,road,road_type,

hazard_id, acc_desc)

vehicle(acc_ref,veh_no,veh_type)

carriageway_hazards(hazard_id,

hazard_desc)

In the accident table, each accident is uniquely iden-

tiﬁed by an acc

ref, the road attribute identiﬁes the

road the accident occurred on, and road

type indi-

cates the type of road. The hazard

id contains the

carriageway hazards code and this is a foreign key to

the carriageway hazards table. We assume that the

multiple lines of the text description of the accident

have been concatenated into the acc desc column.

There may be zero, one or more vehicles associated

with an accident and information about each them is

held in a row of the vehicle table. Here veh

reg

uniquely identiﬁes each vehicle involved in an acci-

dent and thus acc ref,veh no is the key of this table.

The Integrator calls on the wrappers to create a re-

lational schema for AccDB and RDF/S schema for

AccOnt.

The data sources are each now converted from their

model speciﬁc representation into the ESTEST data

model. The table below shows the constructs of the

ESTEST data model and their representation in the

HDM. We see that the ESTEST model provides con-

cepts which are used to represent anything that has

an extent i.e. instance data. Concepts are repre-

sented by HDM nodes e.g. hhfoxii, hhanimalii, and

are structured into an isA hierarchy e.g. hhisA, fox,

COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM

animalii. Concepts can have attributes which are

represented by a node and an unnamed edge in the

HDM e.g. the attribute hhanimal, number of legsii.

We note that in AutoMed’s IQL query language in-

stances of modeling constructs within a schema are

uniquely identiﬁed by their scheme, enclosed within

double chevrons, hh ... ii.

ESTEST Data Model

Construct HDM Representation

Concept:hhcii Node: hhcii

Attribute:hhc,aii Node: hhaii

Edge: hh ,c,aii

isA: hhisA,c1,c2ii Constraint hhc1ii ⊆ hhc2ii

The two data source schemas in our example

are automatically converted into their equivalent ES-

TEST representation when their ESTEST wrapper is

called by the Integrator. The representation of AccDB

is shown in Figure 2 and that of AccOnt in Figure 3.

Similarly, the Integrator now calls each ESTEST

wrapper to collect metadata from its data source. This

metadata is used to assist in ﬁnding correspondences

and for suggesting to the user which schema constucts

could be of use in the later IE step (see section 3.3 be-

low). As well as type information, Word Forms and

Abbreviations are collected as described below. This

metadata is stored in the ESTEST Metadata Reposi-

tory.

Word Forms are words or phrases which repre-

sent a concept. Word forms associated with con-

cepts are important in ESTEST because of their use

in the IE process. The ambiguity inherent in natu-

ral language means that word forms can be associ-

ated with many concepts. The, often limited, tex-

tual clues available in the data source schemas are

used by ESTEST to ﬁnd word forms. These clues

include schema object identiﬁers and comment fea-

tures such as the ‘remarks’ supported by the JDBC

database API. In database schemas, a number of in-

formal naming conventions are often seen, for exam-

ple naming database columns “acc

identiﬁer” or “ac-

cIdentiﬁer”. The Integrator recognises a number of

these conventions and breaks identiﬁers returned by

the wrappers into their component parts.

Abbreviations are often used in database schema

object identiﬁcation, and the Integrator has an ex-

tendable set of heuristics for parsing common naming

conventions e.g. a rule exists for identifying when an

abbreviation of the table name is preﬁxed onto the col-

umn names in a relational table e.g. the abbreviation

“acc” in the column “acc ref” of the “accident” table

used in the example. The user can also enter abbre-

viations commonly used in the application domain.

Abbreviations are combined with full word forms to

generate possible combinations.

There are a number of alternative sources for ex-

panding the word forms for a concept when required:

manually entered, from database description meta-

data, from lower down the concept isA hierarchy, or

from the WordNet (Fellbaum, 1998) natural language

ontology. The Integrator is able to use these to expand

the number of word forms for a concept as required.

A conﬁdence measure is assigned to the word form

depending on its source.

Type metadata is also extracted from the data

sources and is used to suggest which schema con-

structs might be text to be processed by the IE compo-

nent or used as sources of named entities. Named en-

tity recognition is one of the core tasks in information

recognition and involves looking up ﬂat-ﬁle lists con-

taining the known instances of some type. ESTEST

extends this by identifying concepts in the database

schema that are likely to be sources of named entity

types and uses either their extent or the word forms

associated with the concept for the ESTEST named

entity recognition component.

In our running example, abbreviations are sug-

gested e.g. “acc” for “accident”, “veh” for “vehi-

cle”. The Integrator generates alternative word forms

for the schema names and presents them to the user

e.g. for the schema construct hhaccident, acc

refii

the associated word forms are “accident reference”,

“accident” and “reference”. The user is now asked

if they wish to expand the current word forms of

each concept. Suppose that for now only the concept

hhanimalii is selected to be expanded and solely from

the schema. In this case the word forms “cat” and

“fox” are added to hhanimalii from the isA hierarchy.

In order to complete the integration and develop

the global schema, the Integrator next attempts to

ﬁnd correspondences between concepts in different

source schemas using the gathered metadata. The

user can accept or amend the list of suggested equiva-

lent schema objects. The corresponding schema con-

stucts are now renamed to have the same name. Using

the facilities provided by AutoMed, each of the ES-

TEST model schemas is incrementally transformed

until they each contain all the schema constucts of all

the source schemas — these are the union schemas

shown in Figure 5. An arbitrary one of these schemas

is designated ﬁnally as the global schema. In our sim-

ple example just “accident” from the domain ontology

and “accident” from the schema is suggested as a cor-

respondence by the Integrator and is accepted by the

user. No additional manual correspondences are sup-

plied and the initial global schema shown in Figure 4

is created. Finally, as the later IE process will require

a repository to store the extracted information, an ad-

ditional data source is created for this purpose by the

Integrator. The schema of this new data source is the

HDM representation of the ESTEST global schema

just derived. This new data source is integrated into

the global schema in the same way as the other data

sources.

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

vehicle

acc_ref

veh_no

attribute

veh_reg

veh_type

attribute

accident

road_type

attribute

road

year

attribute

carriageway

hazards

hazard_id

hazard_desc

attribute

acc_desc

attribute

Figure 2: The AccDB schema represented in the ESTEST model.

resource

obstruction

inanimate

spillage

tree

animal

subClass

fox

cat

bricks

about

RDFS Ontology

Schema

RDF Triples

accident subClass

cause

domain

range

propertyclass

Key

:

resource

obstruction

inanimate

spillage

tree

animal

isA

fox

cat

bricks

isA

RDFS Ontology ESTEST Model

Representation

accident

attribute

isA

Figure 3: AccOnt Ontology and its representation in the ESTEST model.

vehicle

acc_ref

veh_no

attribute

veh_reg

veh_type

attribute

road_type

attribute

road

year

attribute

carriageway

hazards

hazard_id

hazard_desc

attribute

acc_desc

attribute

resource

obstruction

inanimate

spillage

tree

animal

isA

fox

cat

bricks

isA

accident

isA

attribute

Figure 4: Initial ESTEST global schema.

For our example, the resulting network of trans-

formed and integrated schemas is shown in Figure 5

— for each data source there is now an AutoMed

schema and correspondences between schema con-

structs in the data sources have been identiﬁed. There

is a pathway to an equivalent ESTEST model schema

and then on to a union schema. Any one of these

union schema can be used as the global schema for

COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM

AccDB

Data Source

(Relational)

AccOnt

Data Source

(RDF &RDFS)

HDM Store

Acc

(HDM)

Relational

Wrapper

Ontology

Wrapper

HDM

Wrapper

Relational

Schema

Ontology

Schema

HDM

Schema

ESTEST Model

Relational Schema

ESTEST Model

Ontology

Schema

ESTEST Model

HDM Schema

Union

Schema

Union

Schema

Union

Schema

Global

Schema

id

correspondences

Figure 5: AutoMed Schema Network for the Road Trafﬁc Accident Example.

the entire network.

3.2 Create Metadata to Assist the IE

Process

The ESTEST Conﬁguration component now makes

use of the global schema and collected metadata to

suggest basic information extraction rules, such as

macros for named entity recognition, and to create

templates to be ﬁlled from concepts in the schema

which have attributes of unknown value. In IE sys-

tems, templates are ﬁlled by annotations over text (the

template slots). ESTEST extends this notion of tem-

plates in a number of ways. Firstly, some slots in a

template are pre-ﬁlled from the structured part of the

source data and these values are used to attempt to dis-

ambiguate when multiple annotations are suggested

over the same text. Secondly, ESTEST makes use

of type information contained in the metadata to vali-

date suggested annotations. Finally, ESTEST extends

the idea of the relationship that exists between annota-

tions. Normally this only goes as far as linking slots to

templates. There is Annotation Schema idea in GATE

but this is restricted to deﬁning the annotation features

which are allowed and is used to drive the manual en-

try of annotations. In our system, we link annotation

types to concepts in the global schema. In this way

the structural relationships between annotation types

can be used and extracted annotations can be linked to

related instances of other concepts in the source data.

The user conﬁrms or amends the automatically pro-

duced conﬁguration. In our example, the user se-

lects just hhanimalii from the list of suggested named

entity sources. the Conﬁguration component identi-

ﬁes the hhaccidentii concept as a template. As the

hhobstructionii concept has no extent, the IE step will

attempt to ﬁnd values for this. A macro for hhanimalii

is created:

Macro: ANIMAL

({Lookup.minorType == animal })

and a stub JAPE rule for hhobstructionii. The user

enhances this stub as follows:

Rule: OBSTRUCTION1

(({Token.string == "RUNS"} |

{Token.string == "WALKS"}|

{Token.string == "JUMPS"})

(SPACE)?

({Token.string == "INTO"} |

{Token.string == "ONTO"} |

{Token.string == "IN FRONT OF"} )

(ANIMAL) ) :obstruction -->

:obstruction.obstruction =

{kind = "Obstruction",

rule = "OBSTRUCTION1"}

3.3 Information Extraction

Component

Now the GATE-based ESTEST IE component is run.

The following standard components GATE are con-

ﬁgured and assembled into a pipeline: Document Re-

set which ensures the document is reset to its origi-

nal state for reruns; the English Tokeniser splits text

into tokens such as strings and punctuation; Sen-

tence Splitter divides text into sentences; the JAPE

Processor takes the grammar rules developed above

and applies them to the text. Also used is our be-

spoke Schema Gazetteer component which extends

the gazetteers used in the Named Entity recognition

core IE task. The annotation type is linked to a con-

cept in the global schema and instances are found ei-

ther by presenting a query to the global schema for

the extent concept across the data sources, or the word

forms previously associated with the concept are ob-

tained.

To illustrate, suppose in our example the AccDB

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

database contains just three accidents with the follow-

ing descriptions:

AccDB Accident Descriptions

Acc Ref Description

A001234 FOX RUNS INTO ROAD CAUSING V1

TO SWERVE VIOLENTLY

AND LEAVE ROAD OFFSIDE

A005678 UPPERTON ROAD LEICESTER

JUNCTION SYKEFIELD AVENUE

V1 TRAV SYKEFIELD AVE FAILS

TO STOP AT XRDS AND

HITS V2 TRAV UPPERTON RD V2

THEN HITS V3 PKD ON OS

OF UPPERTON RD

A009012 ESCAPED KANGAROO JUMPS IN

FRONT OF V1

When the IE component has run the following anno-

tations are found:

Annotations

Annotation Start End Literal

ANIMAL 1 3 FOX

OBSTRUCTION 1 13 FOX RUNS

INTO ROAD

3.4 Integrate Results of IE

The data extracted from the text is now stored in

the HDM repository. In our example an instance

id is generated “#1” and new instances of HDM

nodes hhfoxii, hhanimalii and hhobstructionii, are

added with value [#1]. Edges hhisA,fox,animalii and

hhisA,animal,obstructionii have the value of the pair

{#1,#1}. Edge hhattribute,accident,obstructionii

has value {A001234,#1}. Queries on the

global schema now include the new fact that a fox

caused accident A001234.

3.5 Remaining ESTEST Phases

The user can now pose queries to the global schema

the results of which will include the new data ex-

tracted from the text. The global schema may subse-

quently be extended by new data sources being added,

or new schema constucts identiﬁed and added to it.

The user may also choose to expand the number of

word forms associated with schema concepts. Fol-

lowing any such changes, the process is then repeated

and new data extracted from the text.

In our example, suppose the user suspects that the

results may not be complete and expands the word

forms from WordNet for hhanimalii. The Integrator

suggests a mapping between this schema object and

a WordNet synset (a set of word forms with the same

meaning) which the user conﬁrms. Word forms are

then obtained by descending the WordNet hypernym

relations. The list obtained includes “kangaroo” and

as a result re-running the Conﬁguration and IE com-

ponents now produces similar additional annotations

for the third accident report, and an additional fact in

the HDM store of KANGAROO as an obstruction for

accident A009012.

4 EXPERIMENTS WITH ROAD

TRAFFIC ACCIDENT DATA

There are a number of variables which affect the per-

formance of the ESTEST system including: the avail-

ability of structured data sources relating to the text,

the degree of similarity between the text instances, the

amount of effort spent by the user conﬁguring the sys-

tem to the speciﬁc application domain, and the do-

main expertise of the user. In order to provide some

initial conﬁrmation of the potential of our approach

we have experimented with a data set of Road Trafﬁc

Accident reports consisting of 1658 accident reports.

These were six-months worth of reports from one of

Britain’s 50 police forces. We identiﬁed ﬁve queries

of varying complexity which cannot be answered by

the STATS-20 structured data alone. These queries

are shown in the table below.

RTA Queries

Q1 Which accidents involved red trafﬁc lights?

Q2 How many accidents took place 30-50m

of a junction?

Q3 How many accidents involve

drunk pedestrians?

Q4 Which accidents were caused by animals?

Q5 How many resulted in a collision

with a lamppost?

The ESTEST system was conﬁgured using a ran-

domly chosen set of 300 reports.

We then ran the system over the remaining 1358

unseen reports and compared the results to a subse-

quent manual examination of each report. The ta-

ble below shows these results in terms of the actual

number of relevant reports, the recall (the number of

correctly identiﬁed reports as a percentage of all the

correct reports) and the precision (the number of cor-

rectly identiﬁed reports as a percentage of all the iden-

tiﬁed reports).

Results of RTA Query Experiments

Query Relevant Recall Precision

Reports

Q1 25 84% 95%

Q2 89 99% 99%

Q3 9 78% 100%

Q4 14 86% 86%

Q5 15 88% 93%

COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM

Where appropriate the queries combined structured

and text results, for example STATS-20 structured

data includes a ﬂag used to indicate whether the ac-

cident took place within 20 meters of a junction but

does not reveal the distance otherwise. In the query

”How many accidents took place 30-50m of a junc-

tion?” accidents with the 20 meters ﬂag set were dis-

carded and only the remaining reports with relevant

IE results considered. Conﬁguring the system took

5 hours to develop a domain ontology and enhanc-

ing the generated stub rules for the IE process. These

results are promising even with the short time spent

conﬁguring the system by a non-domain expert user.

5 RELATED WORK

An alternative to rule-based IE is text mining, in

which some NLP process creates a structured data set

from the text and then this is mined in order to dis-

cover patterns in the structured data (A.H.Tan, 1999).

Our ESTEST system, in contrast, is driven by speciﬁc

new querying requirements and the potential of mak-

ing use of new data sources. None the less, a text min-

ing extension might be a useful addition to ESTEST

for some application domains e.g. the SWISS-PROT

database, while (U.Y. Nahm, 2000) has shown the po-

tential for combining IE and Text Mining in general.

Recent developments in IE have included moves

to support the challenges of knowledge management

(Cunningham et al., 2005) and applications for the se-

mantic web e.g. support for ontologies has recently

been added to GATE (Bontcheva et al., 2004). A

common starting point for recent research is the limi-

tations of the traditional core Named Entity Recogni-

tion task in IE where annotations are assigned a type

from a list of types names, and recent work is moving

towards the idea of Semantic Annotation (Kiryakov

et al., 2003) where the annotation is linked to a con-

cept in an ontology. This is similar to the ESTEST

approach of linking annotations to a concept in a vir-

tual global schema.

The developers of the Knowledge and Information

Management (KIM) platform (Popov et al., 2004) be-

lieve that a lightweight ontology that provides struc-

ture but few axioms is sufﬁcient for the IE task. Our

ESTEST data model can similarly be thought of as a

lightweight ontology providing a taxonomy and the

facility for concepts to have attributes. KIM is based

on an ontology of ‘everything’ (KIMO) pre-deﬁned

to include concepts and entities from the common IE

tasks; annotations that are found by the IE system are

treated separately from the knowledge in the ontol-

ogy. In contrast, ESTEST develops the global schema

from available structured data sources speciﬁc to the

application, and seeks to expand the instance data

and schema incrementally by adding to the previously

known data and schema. Also, ESTEST makes use of

already known instances of structured data which re-

late to the speciﬁc text being processed in order to

assist in ﬁnding slot values, and it uses other schema

constucts and word forms obtained from metadata as

sources for named entities.

Finally, previous work on making use of the text in

UK Road Trafﬁc Accident reports used expert knowl-

edge of the sub-language in the reports to code de-

scription logic-based grammar rules (Wu and Hey-

decker, 1998). ESTEST makes use of the structured

data as well as the text to answer queries and the ap-

proach is more generally applicable to other applica-

tion domains as it does not depend on a speciﬁc re-

stricted sub-language.

6 CONCLUSIONS AND FUTURE

WORK

We have described the ESTEST system, which com-

bines techniques from Data Integration and Informa-

tion Extraction in order to make integrated use of het-

erogeneous data that is structured to varying degrees

as well as related free text. ESTEST does this by ex-

tracting metadata from data sources to support their

integration into a virtual global schema expressed in

the ESTEST data model. This schema and its associ-

ated metadata are used to semi-automatically conﬁg-

ure an IE process. The newly extracted information

from the text is merged into the virtual global database

which can then be used to answer new queries that

could not have been answered before. The user can

extend the global schema, add new data sources, en-

hance the IE conﬁguration, and rerun as required.

This approach is novel in a number of ways. First,

to our knowledge this is the ﬁrst time that a Data Inte-

gration system has been extended to include support

for free text. Second, ESTEST uses schema integra-

tion techniques to integrate available structured data

into a global schema which is used as a light-weight

ontology for semantic annotation — this is a realis-

tic application-speciﬁc alternative to building ontolo-

gies from scratch. Third, ESTEST uses the global

schema to semi-automatically conﬁgure the IE pro-

cess, thereby reducing the conﬁguration overhead of

the IE process.

We have given a simple example illustrating the op-

eration of ESTEST on Road Trafﬁc Accident reports.

Initial experimental results in the same domain indi-

cate the approach is able to increase the utility of text

stored alongside structured data and that the system’s

performance is at least comparable with standalone IE

systems.

Based on this preliminary evaluation we have iden-

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

tiﬁed three areas for further enhancement of ESTEST:

developing its schema matching component by using

IE on the textual schema metadata; exploring identi-

ﬁer disambiguation techniques to assist with the co-

referencing task in IE; and improving the support

for enhancing the global schema with new structure

found in the text.

To investigate how generally applicable our ap-

proach is, we will then evaluate the enhanced ES-

TEST system in the crime investigation and seman-

tic web application domains. In this evaluation, we

will consider the results obtained by ESTEST users

who are experts in the given application domain.

These results will be compared to both any existing

approaches (such as manual inspection or keyword

search) employed by these expert users and to results

obtained by them using a stand-alone IE system.

Finally, in order to support the end user, we will

analyse the requirements of a workbench for access-

ing the ESTEST components and functionality via a

graphical user interface.

REFERENCES

A.H.Tan (1999). Text mining: The state of the art and

the challanges. Proc. of the PAKDD 1999 Workshop

on Knowledge Discovery from Advanced Databases,

pages 65–70.

A.Poulovassilis (2004). A tutorial on the IQL query lan-

guage. Technical report, AutoMed Project.

Appelt, D. (1999). An introduction to Information Ex-

traction. Artiﬁcial Intelligence Communications,

12(3):161–172.

AutoMed Project (2006).

http://www.doc.ic.ac.uk/automed/.

Bairoch, A., Boeckmann, B., Ferro, S., and Gasteiger, E.

(2000). Swiss-Prot: Juggling between evolution and

stability. Brief. Bioinform., 5:39–55.

Bontcheva, K., Tablan, V., Maynard, D., and Cunningham,

H. (2004). Evolving GATE to Meet New Challenges

in Language Engineering. Natural Language Engi-

neering, 10:349—373.

Brickley, D. and Guha, R. (2004). RDF vocabulary descrip-

tion language 1.0: RDF schema. W3C Recommenda-

tion. http://www.w3.org/TR/rdf-schema/.

Cunningham, H., Bontcheva, K., and Li, Y. (2005). Knowl-

edge Management and Human Language: Crossing

the Chasm. Journal of Knowledge Management,

9(5):108–131.

Cunningham, H., Maynard, D., Bontcheva, K., and Tablan,

V. (2002). GATE: A framework and graphical devel-

opment environment for robust NLP tools and appli-

cations. In Proc. of the 40th Anniversary Meeting of

the Association for Computational Linguistics.

Cunningham, H., Maynard, D., and Tablan, V. (2000).

JAPE: a Java Annotation Patterns Engine (Second

Edition). Research memorandum, University of

Shefﬁeld.

Fellbaum, C. (1998). WordNet an electronic lexical

database.

Halevy, A. (2003). Data Integration: A Status Report. In

Weikum, G., Sch

oning, H., and Rahm, E., editors,

BTW, volume 26 of LNI, pages 24–29. GI.

Kiryakov, A., Popov, B., Ognyanoff, D., Manov, D., Kir-

ilov, A., and Goranov, M. (2003). Semantic Annota-

tion, Indexing, and Retrieval. In 2nd International Se-

mantic Web Conference (ISWC2003), pages 484–499.

Lassila, O. and Swick, R. (1999). Resource description

framework (RDF) model and syntax speciﬁcation.

W3C Recommendation. http://www.w3.org/TR/REC-

rdf-syntax/.

Lenzerini, M. (2002). Data Integration: A Theorectical Per-

spective. In Proc. PODS02, pages 247–258.

McBride, B. (2002). Jena: A semantic web toolkit. IEEE

Internet Computing, 6(6):55–59.

McBrien, P. and A.Poulovassilis (2003). Deﬁning peer-

to-peer data integration using both as view rules. In

Proc. Workshop on Databases, Information Systems

and Peer-to-Peer Computing (at VLDB’03), Berlin.

McBrien, P. and Poulovassilis, A. (2003). Data integra-

tion by bi-directional schema transformation rules. In

Proc. ICDE’03, pages 227–238.

Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., and

Kirilov, A. (2004). KIM - a semantic platform for

information extraction and retrieval. Nat. Lang. Eng.,

10(3-4):375–392.

UK Department for Transport (1999). Stats20: Instruc-

tions for the completion of road accident report form.

http://www.dft.gov.uk.

U.Y. Nahm, R. M. (2000). Using Information Extraction to

aid the discovery of prediction rules from text. Proc.

of the KDD-2000 Workshop on text Mining, pages 51–

58.

Williams, D. (2005). Combining data integration and in-

formation extraction techniques. In Proc. Workshop

on Data Mining and Knowledge Discovery, at BN-

COD’05, pages 96–101.

Williams, D. and Poulovassilis, A. (2004). An example

of the ESTEST approach to combining unstructured

text and structured data. In Proc. of the Database and

Expert Systems Applications (DEXA’04), pages 191–

195. IEEE Computer Society.

Wu, J. and Heydecker, B. (1998). Natural language under-

standing in road accident data analysis. Advances in

Engineering Software, 29:599–610.

COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM