
The last category (avoiding external redundan-
cies given by schema information) includes such ap-
proaches as XCQ (Ng et al., 2006) and DTD sub-
traction (Böttcher, Steinmetz, and Klein, 2007). 
They both separate the structural information from 
the textual information and then subtract the given 
schema information from the structural information. 
Instead of a complete XML structure stream or tree, 
they only generate and output information not al-
ready contained in the schema information (e.g., the 
chosen alternative for a choice-operator or the num-
ber of repetitions for a *-operator within the DTD). 
Both approaches, XCQ and DTD subtraction, are 
queriable and applicable to XML streams, but they 
can only be used if schema information is available. 
XQzip (Cheng and Ng, 2004) and the approach 
presented in (Buneman, Grohe, and Koch, 2003) be-
long to the second category (avoiding structural re-
dundancies). They compress the data structure of an 
XML document bottom-up by combining identical 
sub-trees. Afterwards, the data nodes are attached to 
the leaf nodes, i.e., one leaf node may point to sev-
eral data nodes. The data is compressed by an arbi-
trary compression approach. These approaches allow 
querying compressed data, but they are not directly 
applicable to infinite data streams.  
An extension of (Buneman, Grohe, and Koch, 
2003) and (Cheng and Ng, 2004)  is the BPLEX al-
gorithm (Busatti, Lohrey, and Maneth, 2005). This 
approach does not only combine identical sub-trees, 
but recognizes patterns within the XML tree that 
may span several levels, and therefore allows a 
higher degree of compression. In comparison to 
BSBC, this approach does not explicitly define how 
to compress text constants and attribute values con-
tained in XML data and how to distinguish both in 
the compressed XML format. 
The first category (avoiding textual redundancies 
by tokenization) allows for a much faster compres-
sion approach than the second one, as only local data 
has to be considered in the compression as opposed 
to considering different sub-trees as in the second 
category. 
The XMill algorithm (Liefke and Suciu, 2000) is 
an example of the first category. It compresses the 
structural information separately from the data. Data 
is grouped according to its enclosing element and 
collected into several containers, and each container 
is compressed afterwards. The structure is com-
pressed, by assigning each tag name a unique and 
short ID. Each end-tag is encoded by the symbol ‘/’. 
This approach does not allow querying the com-
pressed data.  
XGrind (Tolani and Hartisa, 2002), XPRESS 
(Min, Park, and Chung, 2003) and XQueC (Arion et 
al., n.d.) are extensions of the XMill-approach. Each 
of these approaches compresses the tag information 
using dictionaries and Huffman-encoding (Hufman, 
1952) and replaces the end-tags by either a ‘/’-sym-
bol or by parentheses. All three approaches allow 
querying the compressed data, and, although not ex-
plicitly mentioned, they all seem to be applicable to 
data streams.  
Approaches (Bayardo et al., 2004), (Cheney, 
2001), and (Girardot and Sunderesan, 2000) are 
based on tokenization. (Cheney, 2001) replaces each 
attribute and element name by a token, where each 
token is defined the first time it is used. (Bayardo et 
al., 2004) and (Girardot and Sunderesan, 2000) use 
tokenization as well, but they enrich the data by ad-
ditional information that allows for a fast navigation 
(e.g., number of children, pointer to next-sibling, ex-
istence of content and attributes). All three of them 
use a reserved byte to encode the end-tag of an ele-
ment. They are all applicable to data streams and al-
low querying the compressed data. 
The approach in (Ferragina et al., 2006) does not 
belong to any of the three categories. It is based on 
Burrows-Wheeler Block-Sorting (Burrows and 
Wheeler, 1994), i.e., the XML data is rearranged in 
such a way that compression techniques such as gzip 
achieve higher compression ratios. This approach is 
not applicable to data streams, but allows querying 
the compressed data if it is enriched with additional 
index information. 
The approach in (Zhang, Kacholia, and Özsu, 
2004) is another succinct representation of XML. It 
does not separate the raw data structure that de-
scribes the document tree from the tokens represent-
ing the elements. Therefore, one byte is required to 
represent an end-tag, whereas our approach, BSBC, 
only needs one bit. Furthermore, our separation of 
structural data from element names does not only al-
low for a better compression as shown in the evalua-
tion; it also enables a more efficient evaluation of 
path queries because raw bit data can be compared 
more efficiently than tokens. A second difference is 
our use of inverted element lists instead of token-
dictionaries, which additionally increases the speed 
of path query evaluation significantly because the 
number of possible path hits can be reduced quite 
fast with a simple lookup within the inverted list.  
To the best of our knowledge, the separation of 
an XML stream into different compressed streams 
linked by a bit stream that is also used to evaluate 
path queries is unique to our compression technique, 
WEBIST 2008 - International Conference on Web Information Systems and Technologies
20