ON THE SELF-SIMILARITY OF THE 1999 DARPA/LINCOLN
LABORATORY EVALUATION DATA
Kun Huang
1
, Dafang Zhang
2
1
School of Computer and Communication,
2
School of Software,Hunan University, Changsha, Hunan 410082, P.R. China
Keywords: Intrusion detection, Evaluation data, Network traffic, Self-similarity.
Abstract: While intrusion detection systems (IDSs) are becoming ubiquitous defence, no comprehensive and
scientifically rigorous benchmark is available to evaluate their performances. In 1998 and again in 1999, the
Lincoln Laboratory of MIT conducted a comprehensive evaluation of IDSs and produced the DARPA
off-line evaluation data to train and test IDSs. However, there is the lack of detailed characteristics of the
DARPA/Lincoln Laboratory evaluation data. This paper examines the self-similarity of the 1999
DARPA/Lincoln Laboratory evaluation data sets for training and indicates that the evaluation data clearly
exhibits self-similarity during preceding tens of hours period, while not during other time periods. Also the
likely causes failing self-similarity are explored. These finding results can help evaluators to understand and
use the 1999 DARPA/Lincoln Laboratory evaluation data well to evaluate IDSs.
1 INTRODUCTION
Intrusion detection systems (IDSs) are an important
component of defensive measures protecting
computer systems and networks from rapidly
growing unauthorized intrusion (Denning, 1987).
Numerous different intrusion detection technologies
have been developed and deployed in realistic
environment.
While IDSs are becoming ubiquitous defence, no
comprehensive and scientifically rigorous
benchmark is available to evaluate their
performances. Current evaluation data for IDSs
(Puketza, 1996) can’t be shared publicly due to
privacy and security concerns. In 1998 and again in
1999, the Lincoln Laboratory of MIT conducted a
comprehensive evaluation of IDSs and released the
DARPA off-line evaluation data. The DARPA
evaluation data has been a widespread public
benchmark available to test both host-based and
network-based IDSs, and both signature-based and
anomaly-based IDSs.
IDSs under test are ultimately intended for use in
real network, so it is required that the evaluation data
for IDSs should be realistic. However, the DARPA
evaluation data is only claimed to be similar to real
network traffic, but not validated in literatures
(Richard, 2000; Lippmann, 2000). Since it is shown
that real network traffic captured from Local Area
Networks and Wide Area Networks statistically
exhibits the property of self-similarity (Leland, 1994;
Paxson, 1995; Beran, 1995), the 1999 DARPA
evaluation data which is attack-free network traffic
data for training should also exhibit self-similarity.
McHugh (McHung, 2001) criticizes many
aspects of the 1998 and 1999 DARPA/Lincoln
Laboratory evaluations, including questionable
collected evaluation data, attacks taxonomy, and
evaluation criteria. It is criticized that there is the
lack of statistical characteristics of the DARPA
evaluation data and no validation of similarity to real
network traffic. But the critique doesn’t quantify the
statistical characteristics of the synthetic evaluation
data and deeply explore the raised flaws and likely
causes.
This paper quantifies the statistical property of
self-similarity of the 1999 DARPA/Lincoln
Laboratory evaluation data and explores the likely
causes failing self-similarity. Our contribution will
help evaluators to understand and use the synthetic
evaluation data to train and test IDSs well.
The rest of this paper is organized as follows.
Section 2 overviews the 1999 DARPA evaluation
data. Section 3 gives a brief background of
self-similarity. In Section 4, the self-similarity of the
1999 DARPA evaluation data is explored. Finally,
Section 5 draws conclusions.
75
Huang K. and Zhang D. (2006).
ON THE SELF-SIMILARITY OF THE 1999 DARPA/LINCOLN LABORATORY EVALUATION DATA.
In Proceedings of the International Conference on Security and Cryptography, pages 75-80
DOI: 10.5220/0002096900750080
Copyright
c
SciTePress
2 SUMMARY OF 1999 DARPA
EVALUATION DATA
In 1998 and 1999, the Lincoln Laboratory of MIT
conducted a large-scale quantitative evaluation of
IDSs and publicly released the DARPA evaluation
data that would be a comprehensive benchmark
available through the Lincoln Laboratory website.
To sanitize privacy and security information and
eliminate impact of the operation of real network,
the Lincoln Laboratory developed a real network
traffic model, and then synthesized the normal
behaviours and attack scenarios in an isolated test
bed network (Lippmann, 2000).
The 1999 DARPA evaluation data includes three
weeks of training data with background traffic and
labeled attacks for tuning IDSs, and two weeks of
test data with background traffic and unlabeled
attacks. Every week of the evaluation data has five
weekdays and every day has about 22 hours from 8
AM to 6 AM of the next day, except for Thursday of
week 3 stopping at about 4 AM and Friday in week 3
ending at about 1 AM. Of the five weeks, only
weeks 1 and 3 are attack-free network traffic data
and the rest have been mixed with background traffic
(attack-free traffic) and injected attack traffic. So this
paper focuses on the attack-free background traffic
of weeks 1 and 3.
In week 1 and 3, the network traffic
predominantly occurred during between 8 AM to 6
PM every weekday, while hardly during the rest time.
Over IP layer, TCP and UDP packets dominate the
overall network traffic per day, while other protocols
packets are also generated. It’s noted that the inside
network traffic is nearly the same with the outside
network traffic.
3 BRIEF BACKGROUND OF
SELF-SIMILARITY
3.1 Definition and Properties of
Self-similarity
The most common way that a stochastic process is
called self-similarity with self-similarity parameter
(that is, Hurst parameter
H
), if the rescaled process,
with an appropriate rescaling factor, and the original
process have identical finite-dimensional
distributions (Leland, 1994).
Let
{, 0,1,2,}
t
XXt=="
be wide-sense stationary
stochastic process with mean
μ
, variance
2
σ
, and
autocorrelation function
(), 0rk k
and let
() ()
{,1,2,3,}
mm
k
XXk=="
denote the aggregated time series
process obtained by averaging the original time
series
X
over adjacent, non-overlapping blocks of
size
,( 1,2, )mm= "
, i.e.
()m
X
is given by
()
(1) 1
()/
m
kkm km
XX Xm
−−
=++"
. The process
X
is called
self-similarity if the distribution of each of the
corresponding aggregated process
()
,1
m
Xm
is equal
or approximately equal to that of the original
process
X
(Leland, 1994).
There are four main properties of self-similarity
process: Hurst effect, slowly decaying variance,
long-range dependence, and
f
/1
noise (Rose, 1996).
3.2 Estimating the Hurst Parameter
Various estimators of the Hurst parameter
H
are used
to examine whether a stochastic process exhibits
self-similarity and/or long-range dependence. There
are the following estimation methods (Rose, 1996).
Variance-time Plots The variance of aggregated
time series process
()
,1
m
Xm
is given by
()
var( ) ~
m
Xcm
β
,
or
()
log( ) ~ log( ) log( )
m
Xmc
β
−+
as
m →∞
, where c is some
positive constant and
10 <
β
. In the log-log plot
of the sample variance versus the aggregation level,
a straight line with slope
β
would be estimated,
thus since
1/2H
β
=−
,
H
can be estimated.
R/S Analysis The R/S statistics are shown
by
[()/()]~
H
ERm Sm cm
, or
([()/()])~ log() log()logERm Sm H m c+
as
m →∞
, where
0.5 1H
<
<
. In the log-log plot of the
R/S statistics versus the number of points of the
aggregated series, the slope of the straight line would
be an estimation of the Hurst parameter
H
.
Periodogram Method This method plots the
logarithm of the spectral density of a time series
process versus the logarithm of the frequencies, that
is
log( ( )) ~ ( ) log( ) log( )
f
c
λ
γλ
+
as
0
λ
, where
01
γ
<
<
,
(1 ) / 2H
γ
=
+
and
c
is some positive constant, and the
slope of the straight line is estimated for Hurst
parameter. The periodogram is given
by
2
1
() | () |/(2 )
N
ij
j
I
Xje N
λ
λπ
=
=
, where
λ
is the frequency,
N
is the length of the time series and
X
is the actual
time series. The periodogram
()
I
λ
is an
asymptotically unbiased estimate of the spectral
density
()f
λ
.
Whittle’s Maximum Likelihood Estimator
(MLE) Since the periodogram is not appropriate to
estimate the spectral density, the Whittle’s MLE is
SECRYPT 2006 - INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY
76
used to estimate the spectral density by minimizing
an approximate log-likelihood function applied to
the spectral density, thus to obtain the estimation of
Hurst parameter and produce the confidence interval.
The more detailed description of MLE is seen in
(Rose, 1996). However, it is noted that Whittle’s
MLE only make a accurate estimation if it is known
that the process is self-similar.
Abry-Veitch Wavelet-based Analysis This
method computes the Discrete Wavelet Transform,
averages the sequences of the coefficients of the
transform, and then performs a linear regression on
the logarithm of the average, versus the log of
j
, the
scale parameter of the transform. The result should
be directly proportional to
H
. The more detailed
description is seen in (Rose, 1996).
4 EXPERIMENTAL RESULTS
Self-similarity of attack-free training data of the
1999 DARPA/Lincoln evaluation data set is
examined using above five estimation methods of
Hurst parameters. Since the last two methods are
used to provide an accurate estimate if the process is
self-similar, the preceding three methods are used to
check whether the process is self-similar or not and
the last two methods are used to estimate Hurst
parameter accurately.
It is assumed that H
var
, H
R/S
, H
Whittle
, and H
Abry-Veith
represent the estimated Hurst parameter by
respectively using variance-time plots, R/S analysis,
periodogram method, Whittle’s MLE and
Abry-Veitch Wavelet-based analysis, and H
avg
represents the average of estimated Hurst parameters.
If 0.5<( H
var
+ H
R/S
+ H
Whittle
)/3<1, H
avg
is the average
of above five estimated Hurst parameters; otherwise
H
avg
=( H
var
+ H
R/S
+ H
Whittle
)/3.
4.1 Examining Self-similarity
The total counts of frame arrival are recorded in each
0.3-second interval. Thus above five estimation
methods are used to estimate the Hurst parameters
and compute the average values during each 1 hour
with 12 000 sample points. The Hurst estimates of
frame counts arrival process of week 1 on the inside
and outside network are shown in Figure 1 (a) and (b)
respectively. Similarly, Figure 2 shows the Hurst
estimates of frame counts arrival process of week 3
on the inside and outside network.
It is denoted by Figure 1 that the evaluation data
clearly exhibits self-similarity during all of 08 AM to
09 PM periods every weekday of week 1 on the both
inside and outside network. Figure 2 also shows that
the evaluation data clearly exhibits self-similarity
during all of 08 AM to 07 PM periods every
weekday of week 3 on the inside network and during
all of 08 AM to 10 PM periods every weekday of
week 3 on the outside network. During the other
time of week 1 and 3, the 1999 DARPA evaluation
data can’t clearly exhibits the property of
self-similarity, and especially the Hurst parameter
values are undulated, which means that sometimes
the evaluation data exhibits self-similarity and
sometimes it fails. Table 1 shows these periods of
week 1 and 3 when the evaluation data fails
self-similarity.
At the same time, Figure 1, Figure 2 and Table 1
show that except on Monday of week 1, and on
Tuesday and Wednesday of week 3, the evaluation
data on the inside network exhibits different
self-similarity from that on the outside network,
although synthetic traffic is generated to
intercommunicate and pass through both the inside
and outside network.
4.2 Investigating the Likely Causes
Failing Self-similarity
For those periods listed by Table 1, which fail to
exhibit self-similarity, the likely causes are
investigated as follows.
Second, certain application layer protocol (i.e.,
HTTP) generated by Poisson model absolutely
dominates the whole traffic packet distribution of the
evaluation data failing self-similarity. Figure 3 (b)
shows that during from 03 AM to 04 AM period on
Wednesday of week 3 HTTP packets dominate the
TCP services. Since HTTP activities are generated
by Poisson model, the whole packets trend to exhibit
Poisson model and fails self-similarity.
Final, UDP dominates the whole traffic and
dilutes the effect of TCP that maintains the property
of self-similarity. Figure 3 (c) shows that during
from 00 AM to 06 AM on Monday of week 1, UDP
dominates the whole traffic on the inside network.
Since it is indicated by Park (Park, 1996) that the
reliable TCP serve to maintain the self-similarity and
the unreliable and no-flow-controlled UDP results in
showing little self-similarity. So UDP dominates the
whole traffic and dilutes the effect of TCP, which
results in failing self-similarity.
ON THE SELF-SIMILARITY OF THE 1999 DARPA/LINCOLN LABORATORY EVALUATION DATA
77
Table 1: Periods of week 1 and 3 that fail to exhibit self-similarity.
0.0
0.5
1.0
12345678910111213141516171819202122
Hour
Hurst Estimates
Mon
Tue
Wed
Thu
Fri
0.0
0.5
1.0
12345678910111213141516171819202122
Hour
Hurst Estimates
Mon
Tue
Wed
Thu
Fri
Figure 1: Hurst estimates of frame counts process in week 1 on the inside and outside network.
0.0
0.5
1.0
1 2 3 4 5 6 7 8 9 10111213141516171819202122
Hour
Hurst Estimates
Mon
Tue
Wed
Thu
Fri
0.0
0.5
1.0
12345678910111213141516171819202122
Hour
Hurst Estimates
Mon
Tue
Wed
Thu
Figure 2: Hurst estimates of frame counts process in week 3 on the inside and outside network.
02 AM to 03 AM On Tuesday Week 1
0
20
40
60
80
100
120
1 201 401 601 801 1001
Bin Size 3 Seconds
Packet Counts Per
Bin
43296
1101
16
12
1
10
100
1000
10000
100000
HTTP TELNET SSH Ot her s
TCP Ser v i c es
Packet Count s
1
100
10000
1000000
00-01 01-02 02-03 03-04 04-05 05-06
Time of Day
Packet Counts
TCP UDP ICMP
Figure 3: The likely causes failing self-similarity are investigated.
Week 1 Mon Tue Wed Thu Fri
Inside
Network
00 A.M.~ 06 A.M. 21 P.M. ~ 00 A.M.
01 A.M.~ 06 A.M.
Outside
Network
00 A.M.~ 06 A.M. 01 A.M. ~ 06 A.M. 02 A.M.~ 04 A.M. 21 P.M. ~ 00 A.M.
01 A.M ~ 02 A.M.
04 A.M. ~ 06 A.M.
22 P.M. ~ 23 P.M.
00 A.M. ~ 06 A.M.
Week 3 Mon Tue Wed Thu Fri
Inside
Network
02 A.M. ~ 03 A.M. 22 P.M. ~ 00 A.M.
01 A.M. ~ 06 A.M.
04 A.M. ~ 05 A.M. 19 P.M. ~ 20 P.M.
22 P.M. ~ 01 A.M.
Outside
Network
00 A.M. ~ 01 A.M.
02 A.M. ~ 06 A.M.
02 A.M. ~ 03 A.M. 22 P.M. ~ 00 A.M.
01 A.M. ~ 06 A.M.
00 A.M. ~ 05 A.M.
(a) Inside network
(b) Outside network
(a) Inside network (b) Outside network
(a) Traffic rates (b) Application protocol distribution (c) TCP, UDP, ICMP packet distribution
SECRYPT 2006 - INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY
78
4.3 Related Work
Allen and Marin (Allen, 2003) examine the
attack-free training data for the presence of
self-similarity in various time periods by using
periodogram method and Whittle’s MLE. Their
finding results show that the 1999 DARPA
evaluation data exhibits self-similarity during from
08 AM to 06 PM periods, while our results show that
the evaluation data does during from 08 AM to 09
PM periods of week 1 on both inside and outside
network, and during from 08 AM to 07 PM periods
of week 3 on the inside network and from 08 AM to
10 PM periods on the outside network.
Compared with (Allen, 2003), we provide more
accurate and detailed Hurst parameter values by
using more estimation methods, and consider the
difference of the evaluation data on the inside
network from that on the outside network.
5 CONCLUSIONS
This paper examines the self-similarity of the 1999
DARPA/Lincoln Laboratory evaluation data by
using five estimation methods of Hurst parameter.
The experimental results denote that the evaluation
data clearly exhibits self-similarity during from 08
AM to 09 PM periods of week 1 on both inside and
outside network, and during from 08 AM to 07 PM
periods of week 3 on the inside network and during
from 08 AM to 10 AM periods on the outside
network, while during other time periods it fails
self-similarity.
Three likely causes failing self-similarity are
explored as follows: (1) traffic rate is too lower (2)
certain application-level protocol (i.e., HTTP)
generated by Poisson model absolutely dominates
the whole traffic;(3) UDP dominates the whole
traffic and dilutes the effect of TCP, which result in
showing little self-similarity. Our findings would
help evaluators to use the evaluation data well to
evaluate IDSs.
ACKNOWLEDGEMENTS
This work is supported by the National Natural
Science Foundation of China under Grant
No60473031.
REFERENCES
Denning, D.E., 1987.An intrusion-detection model. IEEE
Transactions on Software Engineering, Vol.13,
pp.222-232
Puketza, N., Zhang, K., Chung, M., et al, 1996. A
methodology for testing intrusion detection systems.
IEEE Transactions on Software Engineering, Vol.22,
pp. 719-729
Richard, P., Lippmann, R., Fried, D., et al, 2000.
Evaluating intrusion detection systems: the 1998
DARPA off-line intrusion detection evaluation. Proc.
of the 2000 DARPA Information Survivability
Conference and Exposition, Hilton Head, South
Carolina, pp.12-26
Lippmann, R., Haines, J., Fried, D., et al, 2000.The 1999
DARPA off-line intrusion detection evaluation.
Computer Networks, Vol.34, pp.579-595
Lippmann, R., Haines, J., Fried, D., et al, 2000.Analysis
and results of the 1999 DARPA off-line intrusion
detection evaluation, Proc. of Third International
Workshop on Recent Advances in Intrusion Detection,
Toulouse, France 162-182
McHung, J., 2001.Testing intrusion detection systems: a
critique of the 1998 and 1999 DARPA intrusion
detection system evaluations as performed by Lincoln
laboratory. ACM Transactions on Information and
System Security, Vol.3, ppt.262-294
Leland, W., Taqqu, M., Willinger, W., et al, 1994.On the
self-similar nature of Ethernet traffic. IEEE/ACM
Transactions on Networking, Vol.2, pp.1-15
Paxson, V., Floyd, S., 1995.Wide-area traffic: the failure
of Poisson modeling. IEEE/ACM Transactions on
Networking, Vol.3, pp.226-244
Beran, J., Sherman, R., Taqqu, M., et al, 1995 .Long-range
dependence in variable bit-rate video traffic. IEEE
Transaction on Communications, Vol.43, pp.556-1579
MIT Lincoln Laboratory, 2003.Intrusion detection
evaluation web site. http://www.ll.mit.edu/IST/ideval
Rose, O., 1996. Estimation of the hurst parameter of
long-range dependent time series. Technical Report
No.137, Institute of Computer Science, University of
Würzburg
Park, K., Kim, G., Crovella, M., 1996.On the relationship
between file sizes, transport protocols, and self-similar
network traffic. In: Proc. of 4th International
Conference on Network Protocol, pp.171-180
Allen, W. H., Marin, G. A., 2003.On the self-similarity of
synthetic traffic for the evaluation of intrusion
detection. Proc. of the 2003 Symposium Applications
and the Internet, pp.242-248
ON THE SELF-SIMILARITY OF THE 1999 DARPA/LINCOLN LABORATORY EVALUATION DATA
79