STORAGE TYPES IN THE SEMANTIC BINARY DATABASE

ENGINE

Naphtali Rishe, Malek Adjouadi, Maxim Chekmasov,

Dmitry Vasilevsky, Scott Graham, Dayanara Hernandez

Florida International University, Miami, FL

Ouri Wolfson

University of Illinois at Chicago, Chicago, IL,

Keywords: Semantic binary data model, stor

age type, database management system.

Abstract: Modern database engines support a wide variety of data types. Native support for all of the types is desirable

and convenient for the database application developer, as it allows application data to be stored in the

database without further conversion. However, support for each data type adds complexity to the database

engine code. To achieve a compromise between convenience and complexity, the semantic binary database

engine is designed to support only the binary data type in its kernel. Other data types are supported in the

user-level environment by add-on modules. This solution allows us to keep the database kernel small and

ensures the stability and robustness of the database engine as a whole. By providing extra database tools, it

also allows application designers to get database-wide support for additional data types.

1 INTRODUCTION

Conceptually, a semantic binary database is a set of

facts about objects, (Rishe, 1992). Objects belong to

categories. Relations are defined between categories,

and objects are connected by relations. Objects can

also have attributes, which are considered to be

relations from objects to values. The original

semantic database engine stores information about

schema and abstract objects as a set of facts at the

logical level. The following facts are stored:

• xC

. This is a fact that an abstract object x

belongs to a category C. If an object belongs to

several categories, one fact is stored for each of

these categories. While the object may belong to

any number of different categories that are not

disjoint, it may most commonly belong to

several subcategories of one category. If the

object belongs to the subcategory, it also

belongs to the corresponding category.

• xRy

. The object x is connected to the object y

by the relation R. Two objects can not be

connected twice by the same relation. They are

either connected or not.

• xRv

. The attribute R of object x has value v.

Object attributes are like relations, therefore

they may be multi-valued. However, the object

can not have one specific value of the attribute

two times.

• Cx

. This is an inverse (redundant) fact for xC. It

is used to query the objects that belong to a

certain category.

• yR

-1

x. This is an inverse (redundant) fact for

xRy. It is used to traverse relations in a reverse

order.

• R

-1

vx. This is an inverse (redundant) fact for

xRv. Inverse facts for attributes have a different

structure than the inverse facts for the relations

with abstract objects. They are used in queries

that search for all the objects with a certain

attribute value.

The facts are encoded as binary strings using

reve

rsible encoding. Every y, x, C, R, or R

-1

in the

above facts are encoded with object IDs. The object

ID is an integer; it is encoded with a variable length

encoding that maps the natural ordering of integers

into the lexicographical ordering of strings. This

encoding is compressed in the sense that small

values of object IDs result in short strings. The

encoding also allows the database to find the end of

437

Rishe N., Adjouadi M., Chekmasov M., Vasilevsky D., Graham S., Hernandez D. and Wolfson O. (2005).

STORAGE TYPES IN THE SEMANTIC BINARY DATABASE ENGINE.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 437-440

DOI: 10.5220/0002535804370440

 SciTePress

the encoded integer when it is used as a prefix of a

binary string. Every encoded string starts with an

object ID; we insert different separators to

distinguish between pairs and triples.

The only facts that have values other than object

IDs are xRv and R

-1

vx. The values are encoded in

such a way that natural ordering on the values is

mapped into the lexicographical order of encoded

strings. Encoding of xRv is a concatenation of three

encoded values and a separator. However, encoding

of R

-1

vx presents a problem. Simple concatenation

is no longer sufficient, since we need to be able to

find the end of the encoded v; this imposes a strict

restriction on encoding. Another separator is added

between v and x at the end of the fact.

The semantic binary database distinguishes

between two groups of data types – abstract data

types and concrete data types. Abstract data types

are categories of abstract objects, for example

Person, Professor, Student. Concrete data

types are Integer, String, Datetime among

others.

The hierarchy of data types shown in Figure 1 has

a single data type Binary at the root; only

Binary data type is supported by the engine kernel.

Other data types are supported in the user-level

environment by the add-on modules. This solution

allows us to keep the database kernel small and to

get database-wide support for additional data types.

Figure 1: Sub-schema of concrete types

2 FACT STORAGE

As we have mentioned, the original semantic binary

database engine stores the schema and content of the

database using facts. Every object is given a unique

object ID; the information about the object is

represented as a set of direct facts xC, xRy and xRv.

For every direct fact there is one redundant reverse

fact stored in the database. All facts are encoded into

binary strings which are stored in a B-Tree structure

in lexicographical order. We will call this storage

method fact storage.

With the fact storage method, every attribute of

an object is stored as one direct fact. The opposite is

also true, with every direct fact storing only one

attribute. This approach is inefficient in two ways:

space overhead and performance overhead.

With respect to space overhead, the direct fact

stores an object ID, a relation ID, and some data

related to the B-Tree structure in addition to its

attribute data. For example, an integer attribute of 4

bytes is stored in a fact with the object ID and the

relation ID. With the object ID requiring 4 bytes of

storage, the relation ID requiring 4 bytes, and the

storage infrastructure requiring 4 bytes, a total of 16

bytes is needed to store a single 4-byte integer.

The issue of space overhead was addressed in the

original semantic database design by introducing

two space saving techniques: special encoding of

integers and prefix compression of B-Tree blocks.

Unfortunately, the implementation of these

algorithms does not have acceptable performance.

Encoding of integers requires complex operations at

the bit level. Prefix compression of B-Tree blocks

makes binary searches within the block impossible.

Also, on average, only 4 bytes are saved for every

fact with the prefix compression, which is less than

the overhead introduced. Therefore this technique

does not eliminate space overhead, and significantly

decreases performance.

With respect to performance overhead, retrieval

of information from the fact storage requires several

time-consuming operations. The B-Tree has to be

searched for the correct range of facts. A linear

search of B-Tree blocks takes more time than a

binary search or direct record access. Complex

bitwise operations are needed to unpack relevant

information from the facts. If all the attributes for

one object are needed, unpacking of several facts is

required.

Another problem with fact storage was uncovered

by the standard TPC-D benchmark (TPC-D, 1998).

For some queries in TPC-D, it is necessary to scan

all the objects in a category and retrieve most of

their attributes. While only one elementary query is

required to find all objects in a category, retrieving

all the attributes of every object requires one

elementary query per object. This may result in as

many disk accesses as there are objects in the

category. However, if the objects belonging to one

category are stored closely together in one linear

file, many of them would fit into one disk block.

Scanning the set of objects would only require as

many disk accesses as there are blocks. These

inefficiencies prompted us to search for a better

representation of the data.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

438

3 RECORD STORAGE

Comparing the semantic binary database model and

the relational database model, we can state that a

category roughly corresponds to a table, an object

corresponds to a table row and an attribute

corresponds to a column. Relational databases do

not have a large overhead for infrastructure when

storing data. Typically, all attributes of an object are

stored in a fixed-size record in a positional form. A

fixed amount of space is reserved for the attribute,

requiring no overhead for object IDs or

infrastructure at this level. Several records are

grouped in pages and pages are stored in a linear

file.

The relational databases offer a different way of

storing information, which is called a clustered

index (England, 2001). In a clustered index, a

combination of attributes is designated as a key of a

clustered index. All other attributes are organized in

a record. The record is stored in a B-Tree along with

the key. Since a fixed amount of space is allocated

for every column (attribute) in a row (object), m:m

attributes do not fit into the picture, therefore they

are not directly supported by the relational database

model. Also, if the attribute has a value of NULL, the

record would still have space reserved for it. If data

is sparse and many NULLs are present, this form of

storage becomes space-inefficient.

We propose to adapt clustered index storage for

semantic binary database storage in certain cases to

improve space allocation and performance. Instead

of storing every object attribute in a separate fact,

one fact can be created to store some of the object’s

attributes. Not all the attributes can be stored in a

record. The following conditions have to be met for

the attribute to be qualified for record storage:

1. The attribute should be total. In other words it

can not be NULL, so no special handling of

NULLs is necessary.

2. Cardinality of the attribute should be 1:1 or

m:1. There is at most one attribute for every

object in the category, so no special handling is

required for multi-value attributes. In

combination with condition (1), this implies that

only one value of the attribute is present for

each object in the category.

3. The value of the attribute should have fixed

length and the length is the same for all the

objects; this allows us to allocate appropriate

space for the attribute in a record.

The attributes concatenated together result in a

record that is stored for all the objects in the

category and that has a fixed length equal to the sum

of lengths of all concatenated attributes. The record

v can be placed in one fact xR’v where R’ is a

special system relation introduced for the purpose of

storing the record. We will call this storage method

record storage. Figure 2 illustrates fact and record

storage.

A semantic binary database engine based on the

record storage was implemented and analyzed. It

achieved substantial performance improvements

over the original semantic engine on certain types of

workloads. The typical workload that benefits from

this representation is the TPC-D benchmark.

For some queries in TPC-D, it is necessary to

scan all the objects in a category and retrieve most

of their attributes. Records for the objects belonging

to the same category were placed in a separate file.

To answer the query, a full scan on the entire file

was performed and every record was retrieved. This

required only as many disk accesses as needed for

the size of the result set. In addition, no complex

operations were required to extract information from

the records. Once the record is in memory, several

CPU instructions are enough to retrieve an attribute.

Figure 2: Fact and record storage in semantic binary

engine

4 BITMAP STORAGE

Modern database management systems use bitmaps

as another way of storing information. Bitmap

indexes are available in many commercial systems,

such as IBM DB2 (Winter, 1999) and Oracle

(Jakobsson, 1997). Bitmaps offer the advantage of

saving storage space and improving disk retrieval

speed by requiring only one bit to store a Boolean

value. In the semantic database engine, bitmaps can

be used for storing reverse-fact information on

Boolean, enumeration, and small integer attributes

efficiently. An example of such a case is

representing category membership for objects.

A bitmap is an array of bits, where one bit

corresponds to one database object. Object ID is

used as an index value in a bitmap array. Eight

consecutive bits in a bitmap are stored as one byte.

STORAGE TYPES IN THE SEMANTIC BINARY DATABASE ENGINE

439

Consecutive bytes are stored in blocks. Typical

block sizes vary from 4 kilobytes to 64 kilobytes.

Blocks can be stored sequentially or organized and

referenced by a control structure. Logically, a

bitmap represents a set of objects as a set

membership vector. For every object in the database,

the corresponding Boolean element of the vector is

True if the object belongs to the set. For objects

that do not belong to the set, the corresponding

elements are False. Boolean values of the vector

are represented by bits.

Consider the following method for storing

category membership. Membership information for

each category C is represented by one bitmap, where

each bit corresponds to an object ID. Bit N in the

array is 1 if an object with the object ID N belongs

to category C, and 0 otherwise. Thus, the length of

the array in bits will be equal to the number of

objects in the database and access to this information

will be fast since it is equivalent to direct access to

an element of an array.

The size of the bitmap is proportional to the

domain size. Thus, one bit is allocated in a bitmap

for every abstract object regardless of whether it

belongs to a represented set or not. For example,

consider a set of objects that have certain Boolean

attribute values set to True. The bitmap requires

just one bit to represent each fact; it would otherwise

be represented in fact storage as a string consisting

of object ID, relation ID, and a Boolean value, thus

using at least 10 bytes of storage. Being stored in a

B-tree, this string would also require some space to

maintain the B-Tree block structure. Thus, the

bitmap is about 80 times more efficient in terms of

space for total Boolean attributes. Access to facts in

a B-Tree requires a B-Tree search and unpacking of

facts, which are both non-trivial and time-consuming

operations compared to retrieval of values from a

bitmap, which, in most cases, is a simple direct

access operation.

Operations to access individual bits are simple.

Suppose m is the first object ID in a block and we

need to access a bit that corresponds to the object

with object ID n. The byte number within the block

that the bit belongs to is i=(m-n)/8. Bit number

within the byte is j=(m-n)MOD8, where MOD is the

modulo operation (remainder of a division).

To read the bit, the following formula is used:

B[i]&(1<<j), where B represents block as a byte

array. Result is 0 if bit is 0 and non-zero if bit is 1.

Use B[i]|=(1<<j) to set the bit to 1 and

B[i]&=~(1<<j) to reset the bit to 0

. To set the

bit to value x, use B[i]=(B[i]&~(1<<j))|(x

<<j).

Bitmaps have several attractive properties:

1. Bitmaps represent a set of objects in a compact

way since only one bit is used per object. Under

favourable conditions, this can be further

improved.

2. Operations on bitmaps are fast since only one

CPU instruction is needed to act on several

objects simultaneously.

3. A bitmap is a simple structure and the overhead

for accessing and maintaining it is small.

5 CONCLUSION

While designing a semantic database, the general

approach is to be flexible in selecting storage types.

Fact, record, and bitmap storage might be utilized

simultaneously for different purposes. Record

storage might be used for those attributes it suits

best, whereas fact storage might be used for other

attributes. Category membership might be stored as

bitmaps. With fact storage, it should be up to the

user to select the best storage type for various types

of data.

ACKNOWLEDGEMENTS

This material is based on work supported by the

National Science Foundation under Grants No.

HRD-0317692, EIA-0320956, EIA-0220562, CNS-

0426125, IIS-0326284, CCF-0330342, IIS-0086144,

and IIS-0209190.

REFERENCES

Rishe, N., 1992. Database Design: the Semantic Modeling

Approach, McGraw-Hill. 528 pp.

TPC-D, 1998. Transaction Processing Performance

Council. TPC Benchmark D, Standard Specification

Revision 2.1.

England, K., 2001. Microsoft SQL Server 2000

Performance Optimization and Tuning Handbook,

Digital Press; 1st edition, 320 pp.

Winter, R., 1999. Indexing Goes a New Direction,

Intelligent Enterprise, 2(2), pp. 70-73.

Jakobsson, H., 1997. Bitmap Indexing in Oracle Data

Warehousing. Database seminar at Stanford

University. http://www-db.stanford.edu/dbseminar/

Archive/FallY97/slides/oracle.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

440