THE INDEX UPDATE PROBLEM FOR XML DATA IN XDBMS

Beda Christoph Hammerschmidt

, Martin Kempa

and Volker Linnemann

Institute of Information Systems

University of L

ubeck

Ratzeburger Allee 160, D-23538 L

ubeck, Germany

sd&m AG

software design & management

Carl-Wery-Str. 42, D-81739 M

unchen, Germany

Keywords:

XML, Databases, Indexing, Updates.

Abstract:

Database Management Systems are a major component of almost every information system. In relational

Database Management Systems (RDBMS) indexes are well known and essential for the performant execution

of frequent queries. For XML Database Management Systems (XDBMS) no index standards are established

yet; although they are required not less. An inevitable side effect of any index is that modiﬁcations of the

indexed data have to be reﬂected by the index structure itself. This leads to two problems: ﬁrst it has to be

determined whether a modifying operation affects an index or not. Second, if an index is affected, the index

has to be updated efﬁciently - best without rebuilding the whole index.

In recent years a lot of approaches were introduced for indexing XML data in an XDBMS. All approaches lack

more or less in the ﬁeld of updates. In this paper we give an algorithm that is based on ﬁnite automaton theory

and determines whether an XPath based database operation affects an index that is deﬁned universally upon

keys, qualiﬁers and a return value of an XPath expression. In addition, we give algorithms how we update our

KeyX indexes efﬁciently if they are affected by a modiﬁcation. The Index Update Problem is relevant for all

applications that use a secondary XML data representation (e.g. indexes, caches, XML replication/synchro-

nization services) where updates must be identiﬁed and realized.

1 INTRODUCTION

The Extensible Markup Language (XML) has be-

come the standard data format for exchanging infor-

mation in information systems. XML is an easy and

ﬂexible way to express semistructured data - inde-

pendent from platforms, programming languages and

operating systems. The increasing usage of XML

data demands the connection between XML tech-

nology and database management systems, because

the latter provide a fast, robust, and application in-

dependent way of storing and accessing data. In

recent years a multitude of native XML Database

Management Systems (XDBMS) were introduced by

commercial vendors (e.g. Tamino (Sch

oning, 2001),

Xindice (Apache, )) and research projects (e.g. Natix

(Fiebig et al., 2002), InfonyteDB (Infonyte GmbH,

2003), Timber (Jagadish et al., 2002)). Like in re-

lational DBMS indexes are adopted to decrease the

execution time of frequent queries and may increase

the performance of an XDBMS up to factor 100,000

for realistic scenarios compared to a query execution

without any index. In contrast to relational DBMS,

where indexes and index structures are well known

since decades, indexes in XDBMS are still an active

ﬁeld of research with no standards established yet. In

RDBMS only atomic values of speciﬁed columns are

reﬂected in an index; XML indexes have to cover both

the structure of the data and the values of elements.

A lot of approaches have been introduced in recent

years dealing with indexes for querying XML data.

The problem of updating an index is a minor focus

of most publications. If at all, the authors describe

how their data structure can be updated from a tech-

nical point of view. To the best of our knowledge,

the problem whether an XML index i is affected by a

modifying operation o is never faced before. We call

this problem the XML Index Update Problem (XIUP).

For expressing updates we have to select which

nodes (elements, attributes, or content) in the given

XML data have to be modiﬁed. XPath (World Wide

Web Consortium (W3C), ) is a selection language for

such a purpose. Due to space restriction we cannot

give an introduction to XPath. All examples in this

paper are given in the abbreviated syntax of XPath.

XPath itself cannot be used to modify the XML data.

Christoph Hammerschmidt B., Kempa M. and Linnemann V. (2005).

THE INDEX UPDATE PROBLEM FOR XML DATA IN XDBMS.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 27-34

DOI: 10.5220/0002524600270034

 SciTePress

Therefore a language like XUpdate (e.g. (XML:DB

XUpdate Working Group , )) is needed to express up-

date operations. XUpdate statements contain XPath

expressions to identify the nodes to be modiﬁed.

In this paper we assume that an index is selective;

i.e. it is deﬁned to accelerate a speciﬁc query and

not all queries in general. Index approaches that are

not selective (e.g. Strong DataGuides (Goldman and

Widom, 1997)) reﬂect the whole XML data and may

lead to exhaustive space consumption. Additionally,

every modifying database operation affects the

non-selective index leading to an update. Therefore,

the Index Update Problem for non-selective indexes

is trivial. More about the characteristics of selective

and non-selective index approaches can be found in

previous works (B.C.Hammerschmidt et al., 2004a;

B.C.Hammerschmidt et al., 2004b). We motivate the

XIUP by two examples that operate on data from the

DBLP computer science bibliography project (Ley,

2001). The full DBLP data consists of approximately

500,000 publications, mainly articles, inproceedings

and books. The root node is called dblp and is the

parent node of the different publications.

Example: The index i

is deﬁned to accelerate

XPath expressions of the shape of query q

= / dblp / book [ au thor = ’x ’ ]

Index i

indexes all book elements by the value of

their author child which is interpreted as a key for

this query. In our index approach KeyX all keys are

stored in a search tree offering logarithmic retrieval

time. Thus, if an authors name is given, we ﬁnd the

corresponding books efﬁciently.

The XUpdate operation o

deletes all books that are

written by the author Kempa.

= <xupdate:remove

s e l e c t =/ dblp / book [ author = ’Kempa ’ ] >

</ xupdate:remove>

Obviously one can see that the index i

is affected by

because after executing o

there is no book author

’Kempa’ anymore in the data. The key ’Kempa’ has

to be removed from the index to keep it consistent.

At ﬁrst glance, it seems easy to determine the

affection by comparing the contained XPath ex-

pressions which are equal in this example. But

because XPath expressions may contain more

complex navigational steps the decision can become

more difﬁcult; this is shown in the following example.

Example: Index i

indexes all child elements

of the dblp element which have a title child that

is used as key.

= / dblp / ∗ [ t i t l e = ’x ’ ]

The modifying operation o

deletes all children of

all article elements.

= <xupdate:remove

s e l e c t =/ dblp / a r t i c l e / ∗ />

</ xupdate:remove>

First, one can remark that the contained XPath ex-

pressions are not equal. Second, without any schema

information like a DTD or XMLSchema we do not

know if the dblp element is allowed to have an

article element and that the article element

may have a child named title. Due to the wild-

card operator (*) it is not sufﬁcient to perform a string

comparison of both XPath fragments. With one or

more descendant axis (//) this problem becomes even

more complex.

In this work we deﬁne the XML Index Update Prob-

lem (XIUP) formally and introduce an efﬁcient algo-

rithm solving the XIUP for a fragment of XPath con-

taining node tests, the child axis (/), the descendant

axis (//), wildcards(*) and qualiﬁers ([]) without the

NOT operator. This class is called XP

{[],∗,//}

. The

algorithm calculates the intersection of two XPath ex-

pressions and checks its emptiness. For the more gen-

eral XPath fragment XP

{[],∗,//,NOT }

containing the

operator NOT in qualiﬁers the XIUP is still decidable

but becomes NP complete leading to inefﬁcient al-

gorithms with exponential runtime in the worst case.

Further, we present algorithms for updating the key-

oriented XML index (KeyX).

The remainder of this paper is organized as follows:

In Section 2 we introduce and formalize an abstract

deﬁnition of an index structure for XML data. Sec-

tion 3 deﬁnes the XML Index Update Problem and re-

duces it to the Intersection Problem of XPath expres-

sions. Finite automata that calculate the emptiness of

the intersection are introduced in this section. In Sec-

tion 4 we present the implementation of updates in

KeyX. We survey related work in Section 5 and con-

clude the paper in Section 6.

2 INDEX AND UPDATE

DEFINITION

This paper focusses on selective indexes like the In-

dexFabrics (Cooper et al., 2001), APEX (Chung et al.,

2002) or our index approach called KeyX. Indexes

that are non-selective (e.g. Strong DataGuide (Gold-

man and Widom, 1997)) are indexing the whole XML

data. Therefore they are affected by every modifying

operation. This leads to a trivial XIUP. Non-selective

indexes perform best for read-only applications but

fail when changes occur frequently.

In order to deﬁne indexes we have to deﬁne XPath

based path expressions ﬁrst. In this work we distin-

guish between simple and general path expressions.

Deﬁnition 1: Simple Path Expression:

A simple path expression p

for a given alphabet Σ

of element and attribute names is deﬁned as follows:

::= p

| p

//p

| ∗ | n | .

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

with / denoting the child axis, // the descendant axis,

∗ an arbitrary element and n a speciﬁc element with

the label n ∈ Σ; . is the current node. The set of all

simple path expressions is denoted by P

. 

Based on the deﬁnition of simple path expressions

we deﬁne general path expressions which may

include key comparisons and qualiﬁers:

Deﬁnition 2: General Path Expression

A general path expression p for the XPath fragment

{[],∗,//}

is deﬁned as follows:

p ::= p/p | p//p | p[q]∗ | ∗ | n | .

q ::= p | p AND p | p OR p | n op v

r ::= = | < | > | <= | >=

v ::= string | int | ﬂoat

where string, int, and ﬂoat are string, integer and

ﬂoat literals. 

The path expressions of XP

{[],∗,//}

may have

structural qualiﬁers (sometimes called predicates)

and qualiﬁers with value comparisons. We call the

latter keys because the can be compared to keys in

relational indexes.

The grammar for the XPath fragment

{[],∗,//,NOT }

has an additionally NOT oper-

ator in the rules for q.

The XIUP is relevant for any index approach.

We focus on our general KeyX deﬁnition. An

adoption to other index approaches can be made

easily because an index is deﬁned universally upon

keys, qualiﬁers and one return value.

Example: For instance, the XPath query

= / dblp / ∗ [ author = ’Kempa ’ AND

year >2002][ is bn ]

selects all elements below dblp that match a given

author name and a range of year values. Addi-

tionally, the selected elements must have at least one

child named isbn. Because we have value compar-

isons for the author and year we call them keys.

The keys can be stored in an index’s data structure

optimized for fast key retrieval. Qualiﬁers may have

any value, only their existence is requested (here

isbn). Therefore, the value of qualiﬁers can be

ignored. The return value path is the path to the

elements that are selected by the XPath expression.

Deﬁnition 3: Index Declaration

Formally, a selective index i is deﬁned as a triple

i = (K, Q, v) where K is a list of absolute simple

path expressions ∈ P

, referring to the key nodes.

Q a list of absolute simple path expressions ∈ P

referring to the qualiﬁer nodes and v ∈ P

is a simple

path to the value nodes. 

The path expressions to the keys, qualiﬁers and

the return value are extracted from a given XPath

expression using the extraction function described

in (B.C.Hammerschmidt et al., 2004a).

Example: The following index declaration de-

ﬁnes an index suited for the query q

= (

K = [ / dblp / ∗ / author , dblp / book / year ] ,

Q = [ / dblp / ∗ / isbn ] ,

v = dblp /∗

)

In general, a selective index covers all queries with

a result set being a subset of the path expression

that deﬁnes the index. For instance, the index i

may also be used to answer the following query

= /dblp/book[author=’Beda’][isbn]

because the result set of q

is a subset of q

. Details

about containment (subset) of XPath expressions can

be found in (Miklau and Suciu, 2004).

Because relevant keys can be found in logarithmic

time the linear process of comparing each element

that is referenced by the key path is avoided.

Because one selective index covers only a limited

number of queries we need a set of indexes to sup-

port a real application. The problem of ﬁnding a good

set of indexes for a given set of querying and modi-

fying database operations is part of the previous pub-

lication (B.C.Hammerschmidt et al., 2004b) dealing

with this so called Index Selection Problem. Queries

that are not covered by any index must be evaluated

conventionally by the XPath engine of the underlying

XDBMS.

Whenever a modifying operation adds or removes

an element or the content value of an element that is

referenced by at least one of the paths deﬁning the in-

dex, an update of the index may be necessary to keep

index and original data consistent. For instance, if we

add a ﬁrst isbn node to a book having already an

author and a year the index i

has to add this

new book because it is now in the result set of q

3 THE XML INDEX UPDATE

PROBLEM

In this section we introduce the XML Index Update

Problem (XIUP) and reduce it to the Intersection

Problem of two XPath expressions.

An index i covers a query q if the nodes returned

by the index structure are the same as the nodes re-

turned by the database itself. Because the index is

deﬁned upon a return value and a set of keys and a

set of qualiﬁers, the index may be out-dated if one

of these nodes have changed. The key of an index

is a structural and a content property. Therefore we

have to update the index if a key appears, disappears

or its value changes. Qualiﬁers and the return values

THE INDEX UPDATE PROBLEM FOR XML DATA IN XDBMS

are only structural properties, the modiﬁcation of their

values does not touch the index.

The index is affected by a modifying operation

if the path expressions deﬁning the keys, qualiﬁers

or the return value selects at least one node that is

selected by the path expression of the modifying

operation. Formally, this means that the intersection

of the path expressions is not empty.

Deﬁnition 4: Affection

The affection of two XPath expressions p and p

′

∈

{[],∗,//}

is deﬁned as follows:

{ ∃ t ∈ T | p(t) ∩ p

′

(t) 6= ∅ }

with ∅ denoting the empty set. T denotes the set of all

XML data - any XML document that is well formed

is in T . p(t) with t ∈ T is the set of nodes of t

that is returned by the evaluation of p . We denote

Mod(p) ∈ T the set of all XML data where p returns

a non-empty result set. 

Informally, the affection asks if there can be an

arbitrary XML document so that p and p

′

share at

least one node in their result sets. Actually, we are

not interested in that document but only if one may

exist or not. We determine the existence of such a

document by the use of ﬁnite automaton theory. A ﬁ-

nite automaton is deﬁned by its states and transitions.

See (Hopcroft et al., 2001) for details about ﬁnite

automaton theory. We give a procedure to create

ﬁnite automaton for simple path expressions before

we present the algorithm which checks affection.

Deﬁnition 5: Automaton for Mod(p)

We build an automaton A accepting Mod(p) as

follows:

A is a tuple (Q,Σ, σ, q

, F ) with Q = NODES(p)

a set of states, Σ a ﬁnite alphabet consisting of all

element names. σ is a function Q × Σ → Q

deﬁning the set of transitions. q

is the initial state

whereas F is the set of ﬁnal states. For each node

x ∈ NODES(p) with a child y, A has a transition

(y;

′

) → x with

′

∈ Σ the label of the node x. If

the label of x is a wildcard (*) then any symbol ∈ Σ

is valid for the transition. For every descendant edge

e from node x to node y, A has a transition (y;

′

∗

′

)

→ y. These transitions lead to a nondeterministic

automaton. The terminal state is ROOT(p). 

Example: The simple XPath expression

= /a/ ∗ //c//d leads to an automaton A with

Q = {q

∗

} and σ = {t

... t

} with

= (q

;’d’) → q

= (q

;’*’) → q

= (q

;’c’) → q

= (q

;’*’) → q

= (q

;’*’) → q

∗

and ﬁnally

= (q

∗

;’a’) → q

The initial state is q

, q

is the one

ﬁnal state. Please note that the tran-

sitions are not deterministic (e.g. t

). The automaton for this example

is illustrated in the Figure on the right

hands side.

Because we regard the intersection of simple path

expressions with all nodes having one child at maxi-

mum we can use standard ﬁnite automaton theory to

calculate the existence of an intersection. The more

complex tree automaton theory usually applied for

processing XML is not required. The following al-

gorithm checks the affection of two path expressions:

1. create automaton A accepting Mod(p);

2. create automaton A

′

accepting Mod(p

′

);

3. create product autom. B acc. Mod(p) ∩ Mod(p

′

);

4. < q

× q

′

> = initial state of B;

5. < q

× q

′

> = ﬁnal state of B;

6. CLOS = transitive closure of < q

× q

′

7. if ( < q

× q

′

>∈ CLOS) return true;

else return false;

Figure 1: Pseudo code of the affection algorithm

In Step 1 the algorithm creates an automaton A ac-

cepting all XML data where p returns non-empty re-

sult sets . Analogue, Step 2 creates an automaton A

′

for path p

′

. In Step 3 the product automaton B of A

and A

′

is created. B accepts all XML data where both

path expressions p and p

′

evaluate to a non-empty re-

sult set. Informally, one can say that B simulates the

simultaneous execution of A and A

′

accepting if and

only if both automata are accepting.

If this is the case, at least one node of the XML data

is selected by both p and p

′

and the index deﬁned over

p is affected by p

′

and must therefore be updated.

In the last steps of the algorithm it is checked if there

is a path from the initial state to a ﬁnal state of B by

calculating the transitive closure of the initial state of

B. Please note that the emptiness of the intersection is

a property of the product automaton. It is determined

without processing any concrete XML input

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

Algorithm: Update a KeyX index for an Insert operation

Input: 1. KeyX index with declaration x = (K

, Q

, v

)

2. Insert operation with node i

and path i

Output: The KeyX index x is updated concerning i

Method: 1. if isAffected(k

− i

, i

′

) with i

′

∈ path(i

) and k

∈ K

then

/* the Insert operation may contain an indexed key value */

repairKeys(x, i

, i

)

2. if isAf fected(q

− i

, i

′

) with i

′

∈ path(i

) and q

∈ Q

then

/* the Insert operation may contain a qualiﬁer */

repairQualif ier(x, i

, i

)

3. if isAf fected(v

− i

, i

′

) with i

′

∈ path(i

) then

/* the Insert operation may contain an indexed return value */

repairV alues(x, i

, i

)

Figure 2: Update algorithm of KeyX for an Insert operation

Algorithm: Repair Keys after Insert operation

Pre: Insert operation may contain new indexed key value

Input: 1. KeyX index with declaration x = (K

, Q

, v

)

2. Insert operation with node i

and path i

Output: The keys of KeyX index x are updated concerning i

Method: Foreach father node f

∈ root(db).query(i

) to be inserted in do

1. Foreach key value node k

∈ i

.query(k

− f

) do // k

is a new partial key

(a) Foreach key value node K

∈ k

.query(K

− k

) do // K

is the key of k

i. add return value node to index with x.add(K

, r

) and r

∈ K

.query(v

− K

)

if r

∈ x.returnV alues(K

)

Figure 3: Repair algorithm for keys after an Insert operation

4 UPDATE ALGORITHM FOR

KeyX

This section introduces the update algorithm for the

KeyX data structure for the insert operation. The

algorithms for the replace and delete operations are

omitted here for space reasons and are deﬁned analo-

gously. Every replace operation can be expressed by

a delete operation followed by an insert operation.

The section is ﬁnished by presenting performance

measurements of our implementation for the KeyX

index system.

Algorithms updating the index structure

The algorithms presented in this section make use

of some further methods and operations introduced

ﬁrst. The operation

′

−

′

from the path interface can

be applied to two simple paths p

− p

. It returns the

path from a node with path p

to a node with path p

The operation is extended for a tuple of paths as ﬁrst

or second parameter. In this case it returns a tuple of

difference paths. The method path applied to a node

n returns a list of the paths to all sub-nodes of n.

Additionally a method is needed to navigate to a path

expression from a given context node. This is realized

by method query. With n.query(p) the path p is

evaluated concerning the context node n. The result

of this method is a list of nodes. Further, the method

add and returnV alues from the interface of the

KeyX implementation are used. With x.add(K

, r

)

a return node r

can be added to the key value entry

in index x. The method x.returnV alues(K

)

returns the return values of the key nodes K

taking

the qualiﬁers of the index declaration into account.

Generally all identiﬁer with subscript p represents the

path of a node, with subscript n the node itself and

with subscript c the content of the node.

Now the update algorithm for an KeyX index

concerning an insert operation can be presented in

Figure 2.

In a ﬁrst step the algorithm determines if a node

is inserted which relate to an indexed key value, in-

dexed qualiﬁer node, or to an indexed return value

node. If one of these cases is true the belonging key

value nodes are extracted from the database and in-

serted with the relevant return value nodes into the

index.

THE INDEX UPDATE PROBLEM FOR XML DATA IN XDBMS

Algorithm: Repair Values after Insert operation

Pre: Insert operation may contain new values

Input: 1. KeyX index with declaration x = (K

, Q

, v

)

2. Insert operation with node i

and path i

Output: KeyX index x is updated for new values concerning i

Method: Foreach father node f

∈ root(db).query(i

) to be inserted in do

1. Foreach return value node r

∈ i

.query(v

− f

) do // r

is a possible new return value

(a) Foreach key value node K

∈ r

.query(K

− v

) do // K

is the key of r

i. add return value node to index with x.add(K

, r

) if r

∈ x.returnV alues(K

)

Figure 4: Repair algorithm for values after an Insert operation

The algorithm is sound and complete for the XPath

fragment XP

{[],∗,//}

because excluding qualiﬁers

cannot be expressed in this class.

The more general class XP

{[],∗,//,NOT }

allows to

express qualiﬁers in p and p

′

that cannot be fulﬁlled

simultaneously. For instance, if p = /a[b] and

′

= /a[N OT (b)] it is obvious that there can be no

element a having a child b and no child b at the same

time. In this case, the algorithm is still complete

but not sound meaning that an index affection is

indicated where it is not the case. A redundant index

update is effected.

A sound and complete algorithm in general cannot be

as efﬁcient because the XIUP for the XPath fragment

{[],∗,//,NOT }

is NP hard (the NP-complete

problem 3SAT can easily be reduced to the XIUP

by attaching child nodes representing the clauses

from 3SAT to one root node). The proof of the

NP-completeness of the XIUP is beyond the scope of

this paper.

Measurements:

In order to judge our approach we measured the

performance of the algorithm in Figure 2. KeyX

and the algorithms of this paper are done in Java.

We performed all tests on a Pentium 4 with 2.66

Ghz and 1 Gb main memory. In Test 1 we increased

the number of keys in the index’s deﬁnition. As

the diagram in Figure 5 shows the execution time

increases linearly. When increasing the number of

keys and the length of the path expressions deﬁning

the index the execution time increases quadratically;

this is shown in the second graph.

Even for very complex indexes with more than 20

keys and path expressions of the length about 20 the

XIUP can be solved in less than 3 milliseconds. When

an index is affected the underlying index structure has

to be updated by inserting/removing a key or updating

the reference of the return value. Compared to a total

rebuild of an index which may take minutes for tra-

versing the whole XML data this approach is a major

improvement. For applications that frequently update

Figure 5: Measurements of the Index Affection algorithm

an XML database (e.g. orders in an online shop) the

approach offers a suitable index system.

5 RELATED WORK IN

INDEXING XML

In recent years a multitude of approaches for indexing

XML were introduced. One common approach to

index the structure of semistructured data are the so

called structural summaries. One early approach are

the Strong DataGuides (Goldman and Widom, 1997)

providing a general index structure to accelerate

path expressions starting at the root. References to

elements that are reached by the same path expression

are summarized in one node of the DataGuide. The

creation of a DataGuide can be compared to the trans-

formation of a nondeterministic ﬁnite state machine

to a deterministic one by the fusion of equivalent

states. DataGuides differ signiﬁcantly from indexes

in RDBMS: Instead of accelerating speciﬁc queries

very efﬁciently they try to improve the evaluation of

path expressions in general. In contrast to indexes in

RDBMS where indexes are selected by the database

administrator, structural summaries in XDBMS are

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

permanently enabled and thus not selective.

The idea of the DataGuide is extended in sev-

eral works. The problem of non-selective and large

DataGuides is faced in the work of Kaushik et

al (Kaushik et al., 2002) introducing a selective

structural summary.

The same problem motivates the Adaptive Path Index

(APEX) (Chung et al., 2002) from Chung et al. APEX

is a speciﬁc structural index optimized for frequent

queries ignoring the values of elements. Therefore, it

is less efﬁcient for queries containing a key.

Structural summaries are the main data structure or

part of further works, including the Forward-and-

Backward-Index (Abiteboul et al., 1999) the T-Index

(Milo and Suciu, 1999) (with the special cases

1-Index and 2-Index), the D(K)-Index (Chen et al.,

2003), HOPI (Schenkel et al., 2004) and others (Barg

and Wong, 2003; Halverson et al., 2003; Kaushik

et al., 2004; Weigel et al., 2003). Although some

of the work discuss how to update the index’s data

structure none of them explain how an affection is

determined.

Numbering schemes (e.g. (Grust, 2002; H. Jiang and

H. Lu and W. Wang and B. Ooi, 2003)) map each

element of the XML data to one or more numbers that

are mostly determined by post/preorder XML-tree

traversing algorithms. The numbers are used for a

faster retrieval of relationships between elements. For

some numbering schemes a change in the structure

means a recalculation of each assigned number -

this can cost as much as rebuilding the total index.

This problem is faced in the work (Kha et al., 2001)

proposing a numbering schema that is optimized for

updates leading to less number recalculation when

the XML data is modiﬁed. Again, none of these

paper faces the XIUP.

The Index Fabric (Cooper et al., 2001) is a bal-

anced tree structure storing the encoded paths from

the root to each node and its values. Therefore, the

Index Fabric support queries with keys and qualiﬁers.

The data structure of the Index Fabric - a Patricia

trie - can be updated easily like any other search

tree. But the question whether an index is affected

by a modifying operation is not mentioned by the

authors. Our index approach KeyX can be compared

with the index Fabric. Compared to the Index Fabric

KeyX supports more query types including multi-key

queries and range queries.

Approaches that uses an underlying relational

DBMS to store and index XML (e.g. (Bauer et al.,

2003)) have in common that they can reuse existing

and performant implementations of the relational

world. However, XML queries have to be mapped

to SQL queries leading to many expensive Join

operations if a multi-key query is executed. This

lead to performance degradation compared to native

XDBMS.

Recapitulating, a lot of XML index approaches

where introduced in the past. Early approaches

even have underlying data structures that cannot be

updated efﬁciently. Recent approaches use updatable

data structures but - to the best of our knowledge -

we are the ﬁrst that present an algorithm that detects

if an existing index must be updated in order to keep

it consistent.

6 CONCLUSION

In this paper we deﬁned the XML Index Update Prob-

lem (XIUP) and proposed an efﬁcient algorithm solv-

ing the XIUP for the XPath fragments of the class

{[],∗,//}

. We showed that the XIUP for more gen-

eral XPath expressions is NP complete but the algo-

rithm is still identiﬁes all required index updates. The

XIUP is relevant for all approaches that try to index

XML data. To the best of our knowledge we are the

ﬁrst who pay attention to the XIUP and introduced an

efﬁcient algorithm.

Future work will deal with efﬁcient algorithms

ﬁnding a (sub-)optimal sound-and-complete ap-

proximation for XPath fragments of the class

{[],∗,//,NOT }

. A major constraint is that any ap-

proximation has to be completed in order to identify

required index updates and to keep the index and the

data consistent.

ACKNOWLEDGMENTS

We would like to thank Konstantin Ens for imple-

menting the intersection automata and the update al-

gorithms for our KeyX index structure in the scope of

his student research project.

REFERENCES

Abiteboul, S., Suciu, D., and Buneman, P. (1999). Data on

the Web: From Relations to Semistructured Data and

XML. Morgan Kaufmann, USA, 1st edition edition.

Apache. Xindice. URL: http://xml.apache.org/

xindice/.

Barg, M. and Wong, R. (2003). A fast and versatile path

index for querying semistructured data. In Proceed-

ings of the 8th International Conference on Database

THE INDEX UPDATE PROBLEM FOR XML DATA IN XDBMS

Systems for Advanced Applications (DASFAA 2003),

Kyoto, Japan.

Bauer, M. G., Ramsak, F., and Bayer, R. (2003). Multidi-

mensional mapping and indexing of xml. In Proceed-

ings of the 10th BTW Conference: Datenbanksysteme

ur Business, Technologie und Web, volume 26 of LNI,

pages 305–323, Leipzig, Germany. GI.

B.C.Hammerschmidt, Kempa, M., and Linnemann, V.

(2004a). On the index selection problem applied to

key oriented xml indexes. Technical report, A-04-

09, Institute of Information Systems, University of

ubeck.

B.C.Hammerschmidt, Kempa, M., and Linnemann, V.

(2004b). A selective key-oriented xml index for the in-

dex selection problem in xdbms. In Proceedigns of the

15th International Conference on Database and Ex-

pert Systems Applications - DEXA ’04, volume 3180

of Lecture Notes in Computer Science, pages 273–

284, Zaragoza, Spain.

Chen, Q., Lim, A., and Ong, K. W. (2003). D(k)-index:

an adaptive structural summary for graph-structured

data. In Proceedings of the 2003 ACM SIGMOD con-

ference, San Diego, California, USA, pages 134 – 144,

San Diego, California, USA.

Chung, C., Min, J., and Shim, K. (2002). Apex: an adap-

tive path index for xml data. In Proceedings of the

2002 ACM SIGMOD Conference, Madison, Wiscon-

sin, USA, pages 121–132. ACM Press.

Cooper, B. F., Sample, N., Franklin, M. J., Hjaltason, G. R.,

and Shadmon, M. (2001). A fast index for semistruc-

tured data. In Proceedings of 27th International Con-

ference on Very Large Data Bases, Roma, Italy. Mor-

gan Kaufmann.

Fiebig, T., Helmer, S., Kanne, C.-C., Moerkotte, G., Neu-

mann, J., Schiele, R., and Westmann, T. (2002).

Anatomy of a native xml base management system.

VLDB Journal, 11(4).

Goldman, R. and Widom, J. (1997). Dataguides: Enabling

query formulation and optimization in semistructured

databases. In VLDB’97, Proceedings of 23rd Interna-

tional Conference on Very Large Data Bases, pages

436–445. Morgan Kaufmann.

Grust, T. (2002). Accelerating xpath location steps. In

Proceedings of the 2002 ACM SIGMOD Conference,

Madison, Wisconsin, USA, pages 109–120.

H. Jiang and H. Lu and W. Wang and B. Ooi (2003). XR-

Tree: Indexing XML Data for Efﬁcient Structural

Join. In Proceedings of the 19th International Con-

ference on Data Engineering (ICDE), pages 253–263,

Bangalore, India.

Halverson, A., Burger, J., Galanis, L., Kini, A., Krishna-

murthy, R., Rao, A. N., Tian, F., Viglas, S. D., Wang,

Y., Naughton, J. F., and DeWitt, D. J. (2003). Mixed

mode xml query processing. In VLDB’03, Proceed-

ings of 29rd International Conference on Very Large

Data Bases, pages 225–236, Berlin, Germany.

Hopcroft, J. E., Motwani, R., and Ullman, J. D. (2001). In-

troduction to Automata Theory, Languages, and Com-

putation. Addison Wesley Publishing Company.

Infonyte GmbH (2003). Infonyte DB. URL: http://

www.infonyte.com.

Jagadish, H., Al-Khalifa, S., Lakshmanan, L., Nierman, A.,

Paparizos, S., Patel, J., Srivastava, D., and Wu, Y.

(2002). Timber: A native xml database. Technical

report, University of Michigan, USA.

Kaushik, R., Bohannon, P., Naughton, J. F., and Korth,

H. F. (2002). Covering indexes for branching path

queries. In Proceedings of the ACM SIGMOD con-

ference, Madison, Wisconsin, USA.

Kaushik, R., Krishnamurthy, R., Naughton, J. F., and Ra-

makrishnan, R. (2004). On the integration of structure

indexes and inverted lists. In Proceedings of the ACM

SIGMOD conference, Paris, France, pages 779 – 790.

ACM Press.

Kha, D. D., Yoshikawa, M., and Uemura, S. (2001). An

XML indexing structure with relative region coordi-

nate. In 2001, I. C. S., editor, Proceedings of the 17th

International Conference on Data Engineering (ICDE

2001), pages 313–320, Heidelberg, Germany.

Ley, M. (2001). Digital Bibliography & Library Project.

URL: http://dblp.uni-trier.de. Computer

Science Bibliography.

Miklau, G. and Suciu, D. (2004). Containment and equiva-

lence for a fragment of xpath. J. ACM, 51(1):2–45.

Milo, T. and Suciu, D. (1999). Index structures for path ex-

pressions. In Proceedings of Database Theory - ICDT

’99, 7th International Conference, volume 1540 of

Lecture Notes in Computer Science, pages 277–295,

Jerusalem, Israel. Springer.

Schenkel, R., Theobald, A., and Weikum, G. (2004). Hopi:

An efﬁcient connection index for complex xml docu-

ment collections. In Proceedings of the 9th Interna-

tional Conference on Extending Database Technology

(EDBT), volume 2992 of Lecture Notes in Computer

Science, pages 237–255.

Sch

oning, H. (2001). Tamino - a dbms designed for xml. In

Proceedings of the 17th International Conference on

Data Engineering, pages 149–154, Heidelberg, Ger-

many. IEEE Computer Society.

Weigel, F., Meuss, H., Bry, F., and Schulz, K. U. (2003).

Content-Aware DataGuides for Indexing Large Col-

lections of XML Documents. Technical Report PMS-

FB-2003-14, Institute of Informatics, University of

Munich.

World Wide Web Consortium (W3C). XML Path Lan-

guage (XPath). URL: http://www.w3.org/TR/

xpath.

XML:DB XUpdate Working Group . XUpdate - XML

Update Language. URL: http://xmldb-org.

sourceforge.net/xupdate.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION