SUPPΟRTING THE CYBERCRIME INVESTIGATION PROCESS:

EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS

BASED ON BYTE-LEVEL INFORMATION

Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis

Laboratory of Information and Communication Systems Security, Aegean University

Department of Information and Communication Systems Engineering, Karlovasi, Samos, 83200, Greece

Keywords: Source Code Authorship Analysis, Software Forensics, Security.

Abstract: Source code authorship analysis is the particular field that attempts to identify the author of a computer

program by treating each program as a linguistically analyzable entity. This is usually based on other

undisputed program samples from the same author. There are several cases where the application of such a

method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack,

authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based

on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural

language text authorship attribution. We propose a simplified profile and a new similarity measure which is

less complicated than the algorithm followed in text authorship attribution and it seems more suitable for

source code identification since is better able to deal with very small training sets. Experiments were

performed on two different data sets, one with programs written in C++ and the second with programs

written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach

can be applied to any programming language with no additional cost. The presented accuracy rates are much

better than the best reported results for the same data sets.

1 INTRODUCTION

In a wide variety of cases it is important to identify

the author of a piece of code. Such situations include

cyber attacks in the form of viruses, trojan horses,

logic bombs, fraud, and credit card cloning or

authorship disputes or proof of authorship in court

etc. But why do we believe it is possible to identify

the author of a computer program? Humans are

creatures of habit and habits tend to persist. That is

why, for example, we have a handwriting style that

is consistent during periods of our life, although the

style may vary, as we grow older. Does the same

apply to programming? Although source code is

much more formal and restrictive than spoken or

written languages, there is still a large degree of

flexibility when writing a program (Krsul, and

Spafford, 1996).

Source code authorship analysis could be applied

to the following application areas (Frantzeskou et al

2004):

1. Author identification. The aim here is to

decide whether some piece of code was written by a

certain author. This goal is accomplished by

comparing this piece of code against other program

samples written by that author. This type of

application area has a lot of similarities with the

corresponding literature where the task is to

determine that a piece of work has been written by a

certain author.

2. Author characterisation. This application area

determines some characteristics of the author of a

piece of code, such as cultural educational

background and language familiarity, based on their

programming style.

3. Plagiarism detection. This method attempts to

find similarities among multiple sets of source code

files. It is used to detect plagiarism, which can be

defined as the use of another person’s work without

proper acknowledgement.

4. Author discrimination. This task is the

opposite of the above and involves deciding whether

some pieces of code were written by a single author

283

Frantzeskou G., Stamatatos E. and Gritzalis S. (2005).

SUPPÎ§RTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS BASED ON

BYTE-LEVEL INFORMATION.

In Proceedings of the Second International Conference on e-Business and Telecommunication Networks, pages 283-290

DOI: 10.5220/0001414902830290

 SciTePress

or by some number of authors. An example of this

would be showing that a program was probably

written by three different authors, without actually

identifying the authors in question.

5. Author intent determination. In some cases we

need to know whether a piece of code, which caused

a malfunction, was written having this as its goal or

was the result of an accidental error. In many cases,

an error during the software development process

can cause serious problems.

The traditional methodology that has been

followed in this area of research is divided into two

main steps (Krsul, Spafford 1995; MacDonell et al.

2001; Ding 2004). The first step is the extraction of

software metrics and the second step is using these

metrics to develop models that are capable of

discriminating between several authors, using a

machine learning algorithm. In general, the software

metrics used are programming - language dependent.

Moreover, the metrics selection process is a non

trivial task.

In this paper we present a new approach, which

is an extension of a method that has been applied to

natural language text authorship identification

(Keselj et al., 2003). In our method, byte-level N-

grams are utilised together with author profiles. We

propose a new simplified profile and a new

similarity measure which enables us to achieve a

high degree of accuracy for authors for whom we

have a very small training set. Our methodology is

programming - language independent since it is

based on low-level information and is tested to data

sets from two different programming languages. The

simplified profile and the new similarity measure we

introduce provide a less complicated algorithm than

the method used in text authorship attribution and in

many cases they achieve higher prediction accuracy.

Special attention is paid to the evaluation

methodology. Disjoint training and test sets of equal

size were used in all the experiments in order to

ensure the reliability of the presented results. Note,

that in many previous studies the evaluation of the

proposed methodologies was performed on the

training set. Our approach is able to deal effectively

with cases where there are just a few available

programs per author. Moreover, the accuracy results

are high even for cases where the available programs

are of restricted length.

The rest of this paper is organized as follows.

Section 2 contains a review on past research efforts

in the area of source code authorship analysis.

Section 3 describes our approach and section 4

includes the experiments we have performed.

Finally, section 5 contains conclusions and future

work.

2 RELATED WORK

The most extensive and comprehensive application

of authorship analysis is in literature. One famous

authorship analysis study is related to Shakespeare’s

works and is dating back over several centuries.

Elliot and Valenza (1991) compared the poems of

Shakespeare and those of Edward de Vere, 7th Earl

of Oxford, where attempts were made to show that

Shakespeare was a hoax and that the real author was

Edward de Vere, the Earl of Oxford. Recently, a

number of authorship attribution approaches have

been presented (Stamatatos et. al, 2000; Keselj, et

al., 2003; Peng et al, 2004) proving that the author of

a natural language text can be reliably identified.

Although source code is much more formal and

restrictive than spoken or written languages, there is

still a large degree of flexibility when writing a

program (Krsul, and Spafford, 1996). Spafford and

Weeber (1993) suggested that it might be feasible to

analyze the remnants of software after a computer

attack, such as viruses, worms or trojan horses, and

identify its author. This technique, called software

forensics, could be used to examine software in any

form to obtain evidence about the factors involved.

They investigated two different cases where code

remnants might be analyzed: executable code and

source code. Executable code, even if optimized,

still contains many features that may be considered

in the analysis such as data structures and

algorithms, compiler and system information,

programming skill and system knowledge, choice of

system calls, errors, etc. Source code features

include programming language, use of language

features, comment style, variable names, spelling

and grammar, etc.

Oman and Cook (1989) used “markers” based on

typographic characteristics to test authorship on

Pascal programs. The experiment was performed on

18 programs written by six authors. Each program

was an implementation of a simple algorithm and it

was obtained from computer science textbooks.

They claimed that the results were surprisingly

accurate.

Longstaff and Shultz (1993) studied the WANK

and OILZ worms which in 1989 attacked NASA and

DOE systems. They have manually analyzed code

structures and features and have reached a

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

284

conclusion that three distinct authors worked on the

worms. In addition, they were able to infer certain

characteristics of the authors, such as their

educational backgrounds and programming levels.

Sallis et al (1996) expanded the work of Spafford

and Weeber by suggesting some additional features,

such as cyclomatic complexity of the control flow

and the use of layout conventions.

An automated approach was taken by Krsul and

Spafford (1995) to identify the author of a program

written in C. The study relied on the use of software

metrics, collected from a variety of sources. They

were divided into three categories: layout, style and

structure metrics. These features were extracted

using a software analyzer program from 88

programs belonging to 29 authors. A tool was

developed to visualize the metrics collected and help

select those metrics that exhibited little within-

author variation, but large between-author variation.

A statistical approach called discriminant analysis

(SAS) was applied on the chosen subset of metrics

to classify the programs by author. The experiment

achieved 73% overall accuracy.

Other research groups have examined the

authorship of computer programs written in C++

(Kilgour et al., 1997); (MacDonell et al. 2001), a

dictionary based system called IDENTIFIED

(integrated dictionary- based extraction of non-

language-dependent token information for forensic

identification, examination, and discrimination) was

developed to extract source code metrics for

authorship analysis (Gray et al., 1998). Satisfactory

results were obtained for C++ programs using case-

based reasoning, feed-forward neural network, and

multiple discriminant analysis (MacDonell et al.

2001). The best prediction accuracy has been

achieved by Case-Based Reasoning and it was 88%

for 7 different authors.

Ding (2004), investigated the extraction of a set

of software metrics of a given Java source code, that

could be used as a fingerprint to identify the author

of the Java code. The contributions of the selected

metrics to authorship identification were measured

by a statistical process, namely canonical

discriminant analysis, using the statistical software

package SAS. A set of 56 metrics of Java programs

was proposed for authorship analysis. Forty-six

groups of programs were diversely collected.

Classification accuracies were 62.7% and 67.2%

when the metrics were selected manually while

those values were 62.6% and 66.6% when the

metrics were chosen by SDA (stepwise discriminant

analysis).

The main focus of the previous approaches was

the definition of the most appropriate measures for

representing the style of an author. Quantitative and

qualitative measurements, referred to as metrics, are

collected from a set of programs. Ideally, such

metrics should have low within-author variability,

and high between-author variability (Krsul and

Spafford, 1996), (Kilgour

et al., 1997). Such metrics

include:

- Programming layout metrics: include those

metrics that deal with the layout of the program. For

example metrics that measure indentation,

placement of comments, placement of braces etc.

- Programming style metrics: Such metrics

include character preferences, construct preferences,

statistical distribution of variable lengths and

function name lengths etc.

- Programming structure metrics: include

metrics that we hypothesize are dependent on the

programming experience and ability of the author.

For example such metrics include the statistical

distribution of lines of code per function, ratio of

keywords per lines of code etc.

- Fuzzy logic metrics: include variables that they

allow the capture of concepts that authors can

identify with, such deliberate versus non deliberate

spelling errors, the degree to which code and

comments match, and whether identifiers used are

meaningful.

However, there are some disadvantages in this

traditional approach. The first is that software

metrics used are programming - language dependant.

For example metrics used in Java cannot be used in

C or Pascal. The second is that metrics selection is

not a trivial process and usually involves setting

thresholds to eliminate those metrics that contribute

little to the classification model. As a result, the

focus in a lot of the previous research efforts, such

as (Ding 2004) and (Krsul, Spafford 1995) was into

the metrics selection process rather than into

improving the effectiveness and the efficiency of the

proposed models.

3 OUR APPROACH

In this paper, we present our approach, which is an

extension of a method that has been successfully

applied to text authorship identification (Keselj, et al

2003). It is based on byte level n-grams and the

utilization of two different similarity measures used

SUPP?RTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE

CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION

285

to classify a program to an author. Therefore, this

method does not use any language-dependent

information.

An n-gram is an n-contiguous sequence and can be

defined on the byte, character, or word level. Byte,

character and word n-grams have been used in a

variety of applications such as text authorship

attribution, speech recognition, language modelling,

context sensitive spelling correction, optical

character recognition etc. In our approach, the Perl

package Text::N-grams (Keselj 2003) has been used

to produce n-gram tables for each file or set of files

that is required. An example of such a table is given

in Table 1. The first column contains the n-grams

found in a source code file and the second column

the corresponding frequency of occurrence.

Tab1e 1: n-gram frequencies extracted from a source code

file.

3-gram Frequency

sio 28

_th 28

f_( 20

_=_ 17

usi 16

_ms 16

out 15

ine 15

\n/* 15

on_ 14

_in 14

fp_ 14

the 14

sg_ 14

_i_ 14

in_ 14

The algorithm used, computes n-gram based

profiles that represent each of the author category.

First, for each author the available training source

code samples are concatenated to form a big file.

Then, the set of the L most frequent n-grams of this

file is extracted. The profile of an author is, then, the

ordered set of pairs {(x

; f

); (x

; f

),…,(x

; f

)} of

the L most frequent n-grams x

and their normalized

frequencies f

. Similarly, a profile is constructed for

each test case (a simple source code file). In order to

classify a test case in to an author, the profile of the

test file is compared with the profiles of all the

candidate authors based on a similarity measure. The

most likely author corresponds to the least dissimilar

profile (in essence, a nearest-neighbour

classification model).

The original similarity measure (i.e. dissimilarity

more precisely) used by Keselj et al (2003) in text

authorship attribution is a form of relative distance:

where f

(n) and f

(n) are the normalized frequencies

of an n-gram n in the author and the program profile,

respectively, or 0 if the n-gram does not exist in the

profile. A program is classified to the author, whose

profile has the minimal distance from the program

profile, using this measure. Hereafter, this distance

measure will be called Relative Distance (RD).

One of the inherent advantages of this approach

is that it is language independent since it is based on

low-level information. As a result, it can be applied

with no additional cost to data sets where programs

are written in C++, Java, perl etc. Moreover, it does

not require multiple training examples from each

author, since it is based on one profile per author.

The more source code programs available for each

author, the more reliable the author profile. On the

other hand, this similarity measure is not suitable for

cases where only a limited training set is available

for each author. In that case, for low values of n, the

possible profile length for some authors is also

limited, and as a consequence, these authors have an

advantage over the others. Note that this is

especially the case in many source code author

identification problems, where only a few short

source code samples are available for each author.

In order to handle this situation, we propose a

new similarity measure that does not use the

normalized differences f

of the n-grams. Hence the

profile we propose is a Simplified Profile (SP) and is

the set of the L most frequent n-grams {x

,…,x

If SP

and SP

are the author and program simplified

profiles, respectively, then the similarity distance is

given by the size of the intersection of the two

profiles:

)2(

SPSP ∩

where |X| is the size of X. In other words, the

similarity measure we propose is just the amount of

common n-grams in the profiles of the test case and

the author.

The program is classified to the author

with whom we achieved the biggest size of

intersection. Hereafter, this similarity measure will

be called Simplified Profile Intersection (SPI). We

have developed a number of perl scripts in order to

create the sets of n-gram tables for the different

values of n (i.e., n-gram length), L (i.e., profile

)1(

)()(

))()((2)()(

)(2)(1

∑∑

∈∈

⎟

⎠

⎞

⎜

⎝

⎛

⎟

⎠

⎞

⎜

⎝

⎛

−−

profilenprofilen

nfnf

nfnfnfnf

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

286

length) and for the classification of the program file

to the author with the smallest distance.

4 EXPERIMENTS

4.1 Comparison with a Previous

Approach

Our purpose during this phase was to check that the

presented approach works at least equally well as the

previous methodologies for source code author

identification. For this reason, we run this

experiment with a data set that has been initially

used by Mac Donell et al (2001) for evaluating a

system for automatic discrimination of source code

author based on more complicated, language-

dependent measures. All programs were written in

C++. The source code for the first three authors was

taken from programming books while the last three

authors were expert professional programmers. The

data set was split (as equally as possible) into the

training set 50% (134 programs) and the test set 50%

(133 programs). The best result reported by Mac

Donell et al (2001) on the test set was 88% using the

case-based reasoning (that is, a memory-based

learning) algorithm. Detailed information for the

C++ data set is given in Table 2. Moreover, the

distribution of the programs per author is given in

Table 3.

Table 2. The data sets used in this study. ‘Programs per

author’ is expressed by the minimum and maximum

number of programs per author in the data set. Program

length is expressed by means of Lines Of Code (LOC).

Data Set

C++ Java

Number of authors 6 8

Programs per author 5-114 5-8

Total number of programs 268 54

Training set programs 134 28

Testing set programs 133 26

Size of smallest program ( LOC) 19 36

Size of biggest program ( LOC) 1449 258

Mean LOC per program 210 129

Mean LOC in training set 206.4 131.7

Mean LOC in testing set 213 127.2

Table 3. Program distribution per author for the C++ data

set.

Training Set Test Set

Author 1 34 34

Author 2 57 57

Author 3 13 13

Author 4 6 6

Author 5 3 2

Author 6 21 21

We used byte-level n-grams extracted from the

programs in order to create the author and program

profiles as well as the author and program simplified

profiles. Table 4 includes the classification accuracy

results for various combinations of n (n-gram size)

and L (profile size). In many cases, classification

accuracy reaches 100%, much better than the best

reported (MacDonell et al, 2001) accuracy for this

data set (88% on the test set). This proves that the

presented methodology can cope with effectively

with the source code author identification problem.

For n<4 and L<1000 accuracy drops. The same

(although to a lower extent) stands for n>6.

More importantly, RD performs much worse

than SPI in all cases where at least one author profile

is shorter than L. For example for L=1000 and n=2,

L is greater than the size of the profile of Author

No5 (the maximum L of the profile of Author No 5

is 769) and the accuracy rate declines to 51%. This

occurs because the RD similarity measure (1) that

calculates similarity is affected by the size of the

author profile. When the size of an author profile is

lower than L, some programs are wrongly classified

to that author. In summary, we can conclude that the

RD similarity measure is not as accurate for those n,

L combinations where L exceeds the size of even

one author profile in the dataset. In all cases, the

accuracy using the SPI similarity measure is better

than (or equal to) that of RD. This proves that this

new and simpler similarity measure is not affected

by cases where L is greater than the smaller author

profile.

4.2 Application to a Different

Programming Language

The next experiment was performed on a

different data set from a different programming

language. In more detail the new data set consists of

student programs (assignments from a programming

language course) written in Java. Detailed

information for this data set is given in Table 2. We

used 8 authors. From each author 6-8 programs were

chosen. Table 5 shows the distribution of programs

per author. The size of programs was between 36

and 258 lines of code. The data set was split in

training and test set of approximately equal size.

This data set has been chosen in order to evaluate

our approach when the available training data per

author are limited (6-7 short programs per author).

SUPP?RTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE

CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION

287

Table 4. Classification accuracy (%) on the C++ data set for different values of n-gram size and profile size using two

similarity measures: Relative Distance and Simplified Profile Intersection.

Profile

Size L

n-gram Size

2 3 4 5 6 7 8

RD SPI RD SPI RD SPI RD SPI RD SPI RD SPI RD SPI

200 98.4 98.4 97.7 97.7 97 97 95.5 95.5 94.7 95.5 92.5 92.5 92.5 94.7

500 100 100 100 100 100 100 99.2 100 98.4 98.4 97.7 97.7 97.7 97.7

1000 51 99.2 100 100 100 100 100 100 100 100 100 100 99.2 99.2

1500 5.3 98.4 100 100 100 100 100 100 100 100 99.2 99.2 99.2 100

2000 1.5 97.7 98.4 100 100 100 100 100 100 100 100 100 100 100

2500 1.5 95.5 99.2 100 100 100 100 100 100 100 100 100 100 100

3000 1.5 95.5 55.6 100 100 100 100 100 100 100 100 100 100 100

Note that the programs written by students usually

have no comments, their programming style is

influenced by the instructor, they can be plagiarised,

circumstances that create some extra difficulties in

the analysis

Table 5. Program distribution per author of the Java data

set.

Training Set Test Set

Author 1 3 3

Author 2 4 4

Author 3 3 2

Author 4 3 3

Author 5 4 4

Author 6 3 3

Author 7 4 3

Author 8 4 4

The results of the proposed method to this data

set are given in Table 6. The best accuracy rate

achieved with similarity measure RD was 84.6%.

Again, when the profile size of at least one author is

shorter than the selected profile size L, the accuracy

of RD drops significantly. Using the similarity

measure SPI, the best result was 88.5%. In generally

SPI performed better than RD. Moreover, it seems

that 4<n<7 and 1000<L<3000 provide the best

accuracy results.

4.3 The Significance of Training Set

Size

The purpose of this experiment was to examine the

degree in which the training set size affects the

classification accuracy. For this reason we used the

C++ data set for which we reached classification

accuracy of 100% for many n, L combinations with

both similarity measures. This result has been

achieved by using a training set of 134 programs in

total. For the purposes of this experiment we used

the same test set as in the experiment of section 4.1

but now we used training sets of different, smaller

size. The smallest training set was comprised by

only one program from each author and the biggest

by 5 programs from each one (with the exception of

one author for whom the available training programs

were only 3). The presented source code author

identification approach was applied to these new

training sets using n=6 and L=1500 and similarity

measure SPI. Note that the training size of authors

was smaller than L in many of these experiments

and as already explained, in such cases the

classification accuracy decreases dramatically when

using the similarity measure RD.

The accuracy results achieved are shown in Table 7.

As can be seen, even with just one program per

author available in the training set, high

classification accuracy was achieved. By adding a

second program per author the accuracy increased

significantly above 96%. Note that the second

programs added in the training set were in average

longer than the first programs (see second column in

table 7). We reached 100% of accuracy for training

set based on five programs per author. This is a

strong indication that our approach is quite effective

even when very limited size of training set is

available; a condition usually met in source code

author identification problems.

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

288

Table 6. Classification accuracy (%) on the Java data set for different values of n-gram size and profile size using two

similarity measures: Relative Distance and Simplified Profile Intersection.

Profile

Size L

n-gram Size

3 4 5 6 7 8

RD SPI RD SPI RD SPI RD SPI RD SPI RD SPI

1000 80.8 80.8 84.6 84.6 84.6 84.6 80.8 80.8 80.8 80.8 84.6 84.6

1500 84.6 84.6 76.9 76.9 80.8 80.8 84.6 84.6 80.8 80.8 80.8 80.8

2000 53.8 80.8 65.4 80.8 76.9 80.8 84.6 88.5 84.6 84.6 84.6 84.6

2500 53.8 73.1 53.8 76.9 53,8 80.8 84.6 88.5 84.6 88.5 84.6 84.6

3000 53.8 73.1 53.8 80.8 50 76.9 53.8 84.6 69,2 84.6 84.6 84.6

Table 7. Classification Accuracy (%) on the C++ data set

using different training set size (in programs per author).

Training

Set Size

Mean LOC

in Training Set

Accuracy

(%)

1 52 63.9

2 212 96.2

3 171 97

4 170 99.2

5 197 100

5 CONCLUSIONS

In this paper, an approach to source code authorship

analysis has been presented. It is based on byte-level

n-gram profiles, a technique successfully applied to

natural language author identification problems. The

accuracy achieved for two data sets from different

programming languages were 88.5% and 100% on

test sets disjoint from training set, improving the

best reported results for this task so far. Moreover

the proposed method is able to deal with very

limited training data, a condition usually met in

source code authorship analysis problems (e.g.,

cyber attacks, source code authorship disputes, etc.)

with no significant compromise in performance.

We introduced a new simplified profile and a

new similarity measure. The advantage of the new

measure over the original similarity measure is that

it is not dramatically affected in cases where there is

extremely limited training data for some authors.

Moreover, the proposed method is less complicated

than the original approach followed in text

authorship attribution.

More experiments have to be performed on

various data sets in order to be able to define the

most appropriate combination of n-gram size and

profile size for a given problem. The role of

comments has also to be examined. In addition,

cases where all the available source code programs

are dealing with the same task should be tested as

well. Another useful direction would be the

discrimination of different programming styles in

collaborative projects.

REFERENCES

Ding, H., Samadzadeh, M., H., Extraction of Java

program fingerprints for software authorship

identification, The Journal of Systems and Software,

Volume 72, Issue 1, Pages 49-57 June 2004,

Elliot, W., and. Valenza, R.,1991, Was the Earl of Oxford

The True Shakespeare?, Notes and Queries, 38:501-

506.

Gray, A., Sallis, P., and MacDonell, S.,, Identified

(integrated dictionary-based extraction of non-

language-dependent token information for forensic

identification, examination, and discrimination): A

dictionary-based system for extracting source code

metrics for software forensics. In Proceedings of

SE:E&P’98 (Software Engineering: Education and

Practice Conference), IEEE Computer Society Press,

pages 252–259., 1998.

Gray, A., Sallis, P., and MacDonell, S., Software

forensics: Extending authorship analysis techniques to

computer programs, in Proc. 3rd Biannual Conf. Int.

Assoc. of Forensic Linguists (IAFL'97), pages 1-8,

1997.

Frantzeskou, G., Gritzalis, S., Mac Donell, S., Source

Code Authorship Analysis for supporting the

cybercrime investigation process, in Proc. 1

International Conference on e-business and

Telecommunications Networks (ICETE04), Vol 2,

pages (85-92), 2004.

SUPP?RTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE

CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION

289

Keselj, V., Peng, F., Cercone, N., Thomas, C., N-gram

based author profiles for authorship attribution, In

Proc. Pacific Association for Computational

Linguistics, 2003.

Keselj, V.,. Perl package Text::N-grams

http://www.cs.dal.ca/~vlado/srcperl/N-grams or

http://search.cpan.org/author/VLADO/Text-N-grams-

0.03/N-grams.pm, 2003.

Kilgour, R. I., Gray, A.R., Sallis, P. J., and MacDonell, S.

G., A Fuzzy Logic Approach to Computer Software

Source Code Authorship Analysis, In the Fourth

International Conference on Neural Information

Processing -- The Annual Conference of the Asian

Pacific Neural Network Assembly (ICONIP'97).

Dunedin. New Zealand, 1997.

Krsul, I., and Spafford, E. H, Authorship analysis:

Identifying the author of a program, In Proc. 8th

National Information Systems Security Conference,

pages 514-524, National Institute of Standards and

Technology., 1995.

Krsul, I., and Spafford, E. H., 1996, Authorship analysis:

Identifying the author of a program, Technical Report

TR-96-052, 1996

Longstaff, T. A., and Schultz, E. E., Beyond Preliminary

Analysis of the WANK and OILZ Worms: A Case

Study of Malicious Code, Computers and Security,

12:61-77, 1993.

MacDonell, S.G, and Gray, A.R. Software forensics

applied to the task of discriminating between program

authors. Journal of Systems Research and Information

Systems 10: 113-127 (2001)

Oman, P., and Cook, C., Programming style authorship

analysis. In Seventeenth Annual ACM Science

Conference Proceedings, pages 320–326. ACM, 1989.

Peng, F., D., Shuurmans, and S., Wang., Augmenting

naive bayes classifiers with statistical language

models, Information Retrieval Journal, 7(1): 317-345,

2004.

Sallis P., Aakjaer, A., and MacDonell, S., Software

Forensics: Old Methods for a New Science.

Proceedings of SE:E&P’96 (Software Engineering:

Education and Practice). Dunedin, New Zealand, IEEE

Computer Society Press, 367-371, 1996

Spafford, E. H., The Internet Worm Program: An

Analysis,” Computer Communications Review, 19(1):

17-49, 1989.

Spafford, E. H., and Weeber, S. A., Software forensics:

tracking code to its authors, Computers and Security,

12:585-595, 1993

Stamatatos, E., N., Fakotakis, and G. Kokkinakis.

Automatic text categorisation in terms of genre and

author. Computational Linguistics, 26(4): 471-495,

2000.

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

290