A Weighted Maximum Entropy Language Model for

Text Classification

Kostas Fragos

, Yannis Maistros

, Christos Skourlas

1 Department of Computer Engineering, National Technical University of Athens, Iroon

Polytexneiou 9 15780 Zografou Athens Greece

2 Department of Computer Engineering, National Technical University

of Athens, Iroon

Polytexneiou 9 15780 Zografou Athens Greece

3 Department of Computer Science, Technical Educational Institute of Athens, Ag Spy

ri-

donos 12210 Aigaleo Athens Greece

Abstract. The Maximum entropy (ME) approach has been extensively used in

various Natural Language Processing tasks, such as language modeling, part-

of-speech tagging, text classification and text segmentation. Previous work in

text classification was conducted using maximum entropy modeling with bi-

nary-valued features or counts of feature words. In this work, we present a

method for applying Maximum Entropy modeling for text classification in a

different way. Weights are used to select the features of the model and estimate

the contribution of each extracted feature in the classification task. Using the X

square test to assess the importance of each candidate feature we rank them and

the most prevalent features, the most highly ranked, are used as the features of

the model. Hence, instead of applying Maximum Entropy modeling in the clas-

sical way, we use the X square values to assign weights to the features of the

model. Our method was evaluated on Reuters-21578 dataset for test classifica-

tion tasks, giving promising results and comparably performing with some of

the “state of the art” classification schemes.

1 Introduction

Manual categorization of electronic digital documents is time-consuming and expen-

sive and its applicability is limited especially for very large document collections.

Consequently, text classification has increased in importance and economic value as

it is related to key technologies for classifying new electronic documents, extracting

interesting information on web and guiding users search through hypertext.

In early approaches to text classificat

ion a document representation model is em-

ployed based on term-based vectors. Such vectors are elements of some high dimen-

sional Euclidean space where each dimension corresponds to a term. Classification

algorithm and supervised learning training are usually applied.

Fragos K., Maistros Y. and Skourlas C. (2005).

A Weighted Maximum Entropy Language Model for Text Classiﬁcation.

In Proceedings of the 2nd International Workshop on Natural Language Understanding and Cognitive Science, pages 55-67

DOI: 10.5220/0002571800550067

 SciTePress

A great number of text categorization and classifying techniques have been pro-

posed to the literature, including Bayesian techniques [1],[2],[3], k-Nearest Neighbors

(k-NN) classification methods [4],[5],[6], the Rocchio algorithm for Information

Retrieval [7],[8], Artificial Neural Networks (ANN) techniques [9],[10],[11],[12],

Support Vector Machines (SVM) learning method [13],[14],[15], Hidden Markov

Models (HMM) [19],[20], and Decision Tree (DT) classification methods

[17],[18],[9],[1]. In most of these methods, the aim is to estimate the parameters of

the joint distribution between the object X, that we want to be classified, and a class

category C and assign the object X to the category with the greater probability. Un-

fortunately, the complexity of the problem in real-world applications implies that the

estimation of the joint distribution is a difficult task. In general such an estimation

involves a potentially infinite set of calculations over all possible combinations of X

and elements of C. Using the Bayes formula the problem can be decomposed to the

estimation of two components P(X|C) and P(C), known as the conditional class dis-

tribution and prior distribution, respectively.

Maximum Entropy (ME) modeling could be seen as an intuitive way for estimating

a probability and has been successfully applied in various Natural Language Process-

ing (NLP) tasks such as language modeling, part-of-speech tagging and text segmen-

tation [23],[24],[25],[26],[28],[29]. The main principle underlying ME is that the

estimated conditional probability should be as uniform as possible (have the “maxi-

mum entropy”). The main advantage of ME modeling for the classification task is

that it offers a framework for specifying any potentially relevant information. Such

information could be expressed in the form of feature functions, the mathematical

expectations (constraints) of which are estimated based on labeled training data and

characterize the class-specific expectations for the distribution. The main principle of

ME could also be seen in the following way: “Among all the allowed probability

distributions, which conform to the constraints of the training data, the one with the

maximum entropy (the most uniform) is chosen”. It can be proved that there is a

unique solution for this problem. The uniformity of the solution found, a condition

known as the “lack of smoothing”, may be undesirable in some cases. For example, if

we have a feature that always predicts a certain class then this feature can be assigned

to a high ranked weight. Another potential shortcoming of the ME modeling is that

the algorithm which is used to find the solution can be computationally expensive due

to the complexity of the problem.

In this work, we try to eliminate the above undesirable situations. As it is well

known, X square statistic has been widely used in NLP tasks. The X square test for

independence can be applied to problems where data is divided into mutual exclusive

categories and has the advantage that it does not assume normally distributed prob-

abilities. The “essence” of the test is to assess the assumption that is related to the

independence of an object X from a category. If the difference between the observed

and expected frequency is great we can reject the assumption about the independence

(null hypothesis). Every word term w in a document d is seen as a candidate feature

and the X square statistic is used to test the independence of the word w (with each

element of the Class Categories c). This test is conducted by counting the (observed)

frequencies of the word, in each class category of the training set. Then the resulting

value of the test is used to select the most representative features for the maximum

entropy model as well as to assign weights to the features giving different importance

for the classification task in each one of them.

In section 2 the application of the X square test for feature extraction based on a

sample of data and the related weighting scheme are discussed. In section 3 the maxi-

mum entropy modeling and the improved iterative scaling (IIS) algorithm are pre-

sented. In section 4 we discuss the way of using maximum entropy modeling for text

classification. In section 5 the experimental results are presented and briefly discussed

and in section 6 the conclusions and future activities are given.

2 X Square Test for Feature Selection

Among the most challenging tasks in the classification process, we can distinguish

the selection of suitable features to represent the instances of a particular class. Addi-

tionally, the choice of the best candidate features can be a real disadvantage for the

selection algorithm, in terms of effort and time consumption [22].

As it has been mentioned above, each document is represented as a vector of

words, as is typically done in the Information Retrieval approach. Although in most

text retrieval applications the entries (constituents) of a vector are weighted to “re-

flect” the importance of the term, in text classification simpler binary feature values

(i.e., a term either occurs or does not occur in a document) are often used. Usually,

text collections contain millions of unique terms and for reasons of computational

efficiency and efficacy, feature selection is an essential step when applying machine

learning methods for text categorization. In this work, the X square test is used to

reduce the dimensionality of data and is also related to the maximum entropy model-

ing.

In 1900, Karl Pearson developed a statistic that compares all the observed and ex-

pected values (numbers) when the possible outcomes are divided into mutually exclu-

sive categories. The chi-square statistic is expressed by the following equation 1:

(1)

Where Greek letter Σ stands for the summation and is calculated over the catego-

ries of all possible outcomes.

The observed and expected values can be explained in the context of hypothesis

testing. If data is divided into mutual exclusive categories and we can form a null

hypothesis about the sample of the data, then the expected value is the value of each

category if the null hypothesis is true. The observed value for each category is the

value that we observe from the sample data.

The chi-square test is a reliable way of gauging the significance of how closely the

data agree with the detailed implications of a null hypothesis.

To clarify things let us see an example based on data from Reuters-21578. Suppose

that we have two distinct class categories c

= ’Acq’ and c

≠ ’Acq’ extracted from

the Reuters-21578 ‘ModApte’ split training data set. We are interested in assessing

the independence of the word ‘usa’ from the elements of class categories c

and c

From the training data set we remove all the numbers and the words that exist in the

stopword list. Counting the frequencies of the word ‘usa’ in the training dataset we

find that the word ‘usa’ appears in the class Acq (c1=’Acq’) 1,238 times, and in the

other categories (classes), which means that the word is in the class (c

≠ ’Acq’),

4,464 times. In the class ‘Acq’ there is a total of 125,907 word terms while in the

other classes a total of 664,241. A total of N=790,148 word terms is contained in the

Reuters-21578 training dataset. It would be useful to use the contingency table 1 in

which the data are classified.

Table 1. Contigency table of frequencies for the word usa and the class Acq (calculation based

on Reuters-21578 ‘ModApte’ split training dataset)

= ‘Acq’ c

≠ ’Acq’

Total

w =

‘usa’ 1,238 4,464

5,702

w ≠ ‘usa’

124,669 659,777

784,446

Total

125,907 664,241

790,148

The assumption about the independence (null hypothesis) is that occurrences of the

word ‘usa’ and the class label ‘Acq’ are independent.

We compute now the expected number of observations (frequencies) in each cell

of the table if the null hypothesis is true. These frequencies can be easily determined

by multiplying the appropriate row and column totals and then dividing by the total

number of observations.

Expected frequencies:

w= ’usa’ and c1=’Acq’: E

= (5,702x125,907)/790,148 = 908.59

w= ’usa’ and c1≠’Acq’: E

= (5,702x664,241)/790,148 = 4,793.4

w≠ ’usa’ and c1=’Acq’: E

= (784,446x125,907)/790,148 = 124,998.4

w≠ ’usa’ and c1≠’Acq’: E

= (784,446x664,241)/790,148 = 659,447.6

Using equation 1 we calculate the X

value:

=(1,238-908.59)

/908.59 + (4,464-4793.4)

/4793.4 +

(124,669-124,998.4)

/124,998.4 + (659,777-659,447.6)

/659,447.6

= 143.096.

Then we calculate the X

value using eq. 1. We find a critical value for a signifi-

cance level a (usually a=0.05) and for one degree of freedom (the statistic has one

degree of freedom for a 2x2 contingency table). If the calculated value is greater than

the critical value we can reject the null hypothesis that the word ‘usa’ and the class

label ‘Acq’ occur independently. So, for a calculated great X

value we have a strong

evidence for the pair (‘usa’, ‘Acq’). Hence, the word ‘usa’ is a good feature for the

classification in the category ‘Acq’.

To make things simpler, we are only interested in calculating great X

values. Our

aim is to choose the most representative features among the large number of candi-

dates and perform classification in a lower dimensionality space.

For a contingency 2-by-2 table the X square values can be calculated by the fol-

lowing formula 2:

(2)

Where a

are the entries of the contingency 2-by-2 table A and N the total sum of

these entries.

Chi-square test has been used in the past for feature selection in text classification

field. Yang and Pedersen [34] compared five measurements in term selection, and

found that the chi-square and information gain gave the best performance.

3 Maximum Entropy Approach

3.1 Maximum Entropy Modeling

The origin of Entropy can be traced in the dates of Shannon [27] when this concept

was used to estimate how much data could be compressed before the transmission

over a communication channel. The entropy H measures the average uncertainty of a

single random variable X:

)(log)()()(

xpxpXHpH

∑

∈

−==

(3)

Where, p(x) is the probability mass function of the random variable X. Equation 4

also calculates the average number of bits we need to transfer all the information.

Formula 3 is used in the communication theory to save the bandwidth of a communi-

cation channel. We prefer a model of X with less entropy so that we can use smaller

bits to interpret the uncertainty (information) inside X. However in NLP tasks we

want to find a model to maximize the entropy. This sounds as though we are violating

the basic principle of entropy. Actually, we try to avoid “bias” when the certainty

cannot be identified from the empirical evidence.

Many problems in NLP could be re-formulated as statistical classification prob-

lems. Text classification task could be seen as a random process Y which takes as

input a document d and produces as output a class label c. The output of the random

Y may be affected by some contextual information X. The domain of X contains all

the possible textual information existing in the document d. Our aim is to specify a

model p(y|x) which denotes the probability that the model assigns y

∈

Y when the

contextual information is x

∈

X. The notation of this section follows that of Adam

Berger [28] [29] [35].

On the first step, we observe the behavior of the random process in a training sam-

ple set collecting a large number of samples (x

, y

), (x

, y

)…(x

, y

). We can sum-

marize the training sample defining a joint empirical distribution over x and y from

these samples:

samplethein) occurstimes (x,ynumber of

yxp ×=

),(

(4)

One way to represent contextual evidence is to encode useful facts as features and

to impose constraints on values of those feature expectations. This is done in the

following way. We introduce the indicator function [28] [35]

{

'2_''1_'1

),(

valuesomexandvaluesomeyif

otherwise

yxf

(5)

For example, in our classification problem an indicator function may be f(x,y)=1 if

y=’c

’ and x contains the word ‘money’ and f(x,y)=0 otherwise. Where ‘c

’ is a par-

ticular value from the class labels and x is the context (the document) where the word

‘crude’ occurs within. Such an indicator function f is called a feature function or

feature for short. Its mathematical expectation with respect to the model p(y|x) is

),(),(

yxfyxp

∑

(6)

We can ensure the importance of this statistic by specifying that the expected value

that the model assigns to the corresponding feature function is in accordance with the

empirical expectation of equation 7.

),(),(),()|()(

yxfyxpyxfxypxp

yxyx

∑∑

(7)

where

is the empirical distribution of x in the training sample. )(

We call the requirement equation 7 a constraint equation or simply a constraint

[28][35].

When constraints are estimated, there are various conditional probability models

which can be applied and satisfy these constraints. Among all these models there is

always a unique distribution that has the maximum entropy and it can be shown [30]

that the distribution has an exponential form:

⎟

⎠

⎞

⎜

⎝

⎛

∑

yxf

xyp ),(exp

)(

)|(

(8)

where Z(x) a normalizing factor to ensure a probability distribution given by

∑∑

⎟

⎠

⎞

⎜

⎝

⎛

yxfxZ ),(exp)(

(9)

where λ

a parameter to be estimated, associated with the constraint f

The solution, which is related to the maximum entropy model and is calculated by

the equation 9, is also the solution to a dual maximum likelihood problem for models

of the same exponential form. It means that the likelihood surface is convex, having a

single global maximum and no local maximum. There is an algorithm that finds the

solution performing Hill Climbing in likelihood space.

3.2 Improved Iterative Scaling

We describe now a basic outline of the improved iterative scaling (IIS) algorithm, a

Hill Climbing algorithm for estimating the parameters λ

of the maximum entropy

model, specially adjusted for text classification. The notation of this section follows

that of Nigam et al. [31] with x to represent a document d and y a class label c.

Given a set of training dataset D, which consists of pairs (d, c(d)), where d the docu-

ment and c(d) the class label in which the document belongs, we can calculate the

loglikelihood of the model of equation 9.

),(explog

)(,()|)((log)|(

cdf

dcdfddcpDpL

Ddci

∑∑∑

∑

∏

∈

−==

λλ

(10)

The algorithm is applicable whenever the feature functions f

(d,c(d)) are non-

negative.

To find the global maximum of the likelihood surface, the algorithm must start

from an initial exponential distribution of the correct form (that is to “guess” a start-

ing point) and then perform Hill Climbing in likelihood space. So, we start from an

initial value for the parameters λ

, say λ

=0 for i=1:K (where K the total number of

features) and in each step we improve by setting them equal to λ

+δi, where δi is the

increment quantity. It can be shown that at each step we can find the best δi by solv-

ing the equation:

∑

∈

Ddc

cdfdcpdcdf )),(exp()|())(,((

(11)

Where f

(d,c) is the sum of all the features in the training instance d.

Equation 12 can be solved in a “closed” form if the f

(d,c) is constant, say M, for

all d, c [28][31].

∑

∈

dcfdcp

dcdf

M ),()|(

))(,(

log

(12)

where p

(c|d) is the distribution of the exponential model of equation 9.

If this is not true, then equation 12 can be solved using a numeric root-finding pro-

cedure, such as Newton’s method.

However in this case, we can still solve equation 12 in “closed” form by adding an

extra feature to provide that f

(d,c) will be constant for all d, c, in the following way:

We define M as the greatest possible feature sum:

∑

cdfM

),(max

(13)

and add an extra feature, that is defined as follows:

∑

−=

cdfMcdf

),(),(

(14)

Now we can present an improved iterative scaling algorithm (IIS)

Begin

Add an extra feature f

K+1

following equations 13,14

Initialize •

=0 for i=1:K+1

Repeat

Calculate the expected class labels p

•

(c|d)

for each document with the current parameters

using equation 9

calculate •

from equation 12

set •

= •

+ •

Until convergence

Output: Optimal parameters •

optimal model p

•

End

4 Maximum Entropy Modeling for Text Classification

The basic shortcoming of the IIS algorithm is that it may be computationally expen-

sive due to the complexity of the classification problem. Moreover, the uniformity of

the found solution (lack of smoothing) can also cause problems. For example, if we

have a feature that always predicts a certain class, then this feature may be assigned to

an excessively high weight. The innovative point in this work is to use the X square

test to rank all the candidate feature words, that is, all the word terms that appear in

the training set and then select the most highly ranked of them for use in the maxi-

mum entropy model.

If we decide to select the K most highly ranked word terms w

, w

, …, w

we instan-

tiate the features as follows:

⎪

⎩

⎪

⎨

⎧

settrainingtheinappears

cdpairanddinoccurs

wwordifixsquare

otherwise

cdf

),()(

),(

(15)

where xsquare(i) denotes the X square score of the word w

obtained during the

feature selection phase.

This way of instantiating features has two advantages: first it gives a weight to each

feature and second it creates a separate list of features for each class label. The fea-

tures from each list can be different from class to class. These features are activated

only with the presence of the particular class label and are strong indicators of it. Of

course some features are common to more than one classes. These lists of features are

used from the resulting binary text classifier (the optimal model of the IIS algorithm)

to calculate the expected class labels probabilities for a document d, equation 9, and

then to assign the document d to the class with the highest probability.

5 Experimental Results

We evaluated our method using the “ModApte” split of the Reuters-21578 dataset

compiled by David Lewis. The “ModApte” split leads to a corpus of 9,603 training

documents and 3,299 test documents. We choose to evaluate only ten (10) categories

(from the 135 potential topic categories) for which there is enough number of training

and test document examples. We want to built a binary classifier and we split the

documents into 2 groups: ‘Yes’ group, the document belongs to the category and

‘No’ group, the document does not belong to the category. The 10 categories with the

number of documents for the training and test phase are shown in table 2.

Table 2. 10 categories from the “ModApte” split of the Reuters-21578 dataset and the number

of documents for the Training and the Test phase for a binary classifier.

Train Test

Categories

Performance

Findsim

Nbayes

BayesNets

Trees

LinearSVM

WMEM

Fig. 1. Breakeven performance calculated over the top ten (10) categories of the Reuters-21578

dataset. The Weighted Maximum Entropy Model (WMEM is the last one) is compared to five

(5) learning algorithms explored by Dumais et al.

Calculations illustrated in the Table 4 and Fig. 1 show that our method gives prom-

ising results especially in the case of the larger categories. It performs better than the

other classifiers in the ‘money-fx’ category and outperforms most of the other classi-

fiers in some of the largest in test size categories like ‘earn’, ‘acq’ and ‘grain’.

6 Discussion and Future Work

There are three other works using maximum entropy for text classification: The work

of Ratnaparkhi [26] is a preliminary experiment that uses binary features. The work

of Mikheev [33] examines the performance of the maximum entropy modeling and

conducts feature selection for text classification on the RAPRA corpus, a corpus of

technical abstracts. In this work binary features were also used. Nigam et al. [31] use

counts of occurrences instead of binary features and they show that maximum en-

tropy is competitive to and sometimes better than naïve Bayes classifier.

In this work, we have extended the previous research results using a feature selection

strategy and assigning weights to the features calculated by the X square test. The

results of the evaluation are very promising. However, the experiments will be con-

tinued in two directions. We shall conduct new experiments changing the number of

the selected features and / or the selection strategy, as well as the number of the itera-

tions in the training phase. Additional experiments using alternative datasets such as,

the WebKB dataset, the Newsgroups dataset etc., will be conducted in order to accu-

rately estimate the performance of the proposed method.

Acknowledgements

This work was co-funded by 75% from the E.U. and 25% from the Greek Govern-

ment under the framework of the Education and Initial Vocational Training Program

– Archimedes.

References

1. Lewis, D. and Ringuette, M., A comparison of two learning algorithms for text categoriza-

tion. In The Third Annual Symposium on Document Analysis and Information Retrieval

pp.81-93, 1994

2. Makoto, I. and Takenobu, T., Cluster-based text categorization: a comparison of category

search strategies, In ACM SIGIR'95, pp.273-280, 1995

3. McCallum, A. and Nigam, K., A comparison of event models for naïve Bayes text classifi-

cation, In AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998

4. Masand, B., Lino, G. and Waltz, D., Classifying news stories using memory based reason-

ing, In ACM SIGIR'92, pp.59-65, 1992

5. Yang, Y. and Liu, X., A re-examination of text categorization methods, In ACM SIGIR’99,

pp.42-49, 1999

6. Yang, Y., Expert network: Effective and efficient learning from human decisions in text

categorization and retrieval, In ACM SIGIR'94, pp.13-22, 1994

7. Buckley, C., Salton, G. and Allan, J., The effect of adding relevance information in a rele-

vance feedback environment, In ACM SIGIR’94, pp.292-300, 1994

8. Joachims, T., A probabilistic analysis of the rocchio algorithm with TFIDF for text catego-

rization, In ICML’97, pp.143-151, 1997

9. Guo, H. and Gelfand S. B., Classification trees with neural network feature extraction, In

IEEE Trans. on Neural Networks, Vol. 3, No. 6, pp.923-933, Nov., 1992

10. Liu, J. M. and Chua, T. S., Building semantic perception net for topic spotting, In ACL’01,

pp.370-377, 2001

11. Ruiz, M. E. and Srinivasan, P., Hierarchical neural networks for text categorization, In

ACM SIGIR’99, pp.81-82, 1999

12. Schutze, H., Hull, D. A. and Pedersen, J. O., A comparison of classifier and document

representations for the routing problem, In ACM SIGIR’95, pp.229-237, 1995

13. Cortes, C. and Vapnik, V., Support vector networks, In Machine Learning, Vol.20, pp.273-

297, 1995

14. Joachims, T., Learning to classify text using Support Vector Machines, Kluwer Academic

Publishers, 2002

15. Joachims, T., Text categorization with Support Vector Machines: learning with many

relevant features, In ECML’98, pp.137-142, 1998

16. Schapire, R. and Singer, Y., BoosTexter: A boosting-based system for text categorization,

In Machine Learning, Vol.39, No.2-3, pp.135-168, 2000

17. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C.J., Classification and Regression

Trees, Wadsworth Int. 1984

18. Brodley, C. E. and Utgoff, P. E., Multivariate decision trees, In Machine Learning, Vol.19,

No.1, pp.45-77, 1995

19. Denoyer, L., Zaragoza, H. and Gallinari, P., HMM-based passage models for document

classification and ranking, In ECIR’01, 2001

20. Miller, D. R. H., Leek, T. and Schwartz, R. M., A Hidden Markov model information

retrieval system, In ACM SIGIR’99, pp.214-221, 1999

21. Kira, K. and Rendell, L. A practical approach to feature selection. In Proc. 9

International

workshop on machine learning (pp. 249-256) 1992

22. Gilad-Bachrach, Navot A., Tishby N. Margin Based Feature Selection - Theory

and Algorithms. In Proc of ICML 2004

23. Stanley F. Chen and Rosenfeld R. A Gaussian prior for smoothing maximum entropy mod-

els. Technical report CMU-CS-99108, Carnegie Mellon University, 1999

24. Ronald Rosenfeld. Adaptive statistical language modelling: A maximum entropy approach,

PhD thesis, Carnegie Mellon University, 1994

25. Ratnparkhi Adwait, J. Reynar, S. Roukos. A maximum entropy model for prepositional

phrase attachment. In proceedings of the ARPA Human Language Technology Workshop,

pages 250-255, 1994

26. Ratnparkhi Adwait. A maximum entropy model for part-of-speech tagging. In Proceedings

of the Empirical Methods in Natural Language Conference, 1996

27. Shannon C.E. 1948. A mathematical theory of communication. Bell System Technical

Journal 27:379 – 423, 623 – 656

28. Berger A, A Brief Maxent Tutorial. http://www-2.cs.cmu.edu/~aberger/maxent.html

29.Berger A. 1997. The improved iterative scaling algorithm: a gentle introduction

http://www-2.cs.cmu.edu/~aberger/maxent.html

30. Della Pietra S., Della Pietra V. and Lafferty J., Inducing features of random fields. IEEE

transaction on Pattern Analysis and Machine Intelligence, 19(4), 1997

31. Nigam K., J. Lafferty, A. McCallum. Using maximum entropy for text classification, 1999

32. Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M, Inductive learning algorithms and

representations for text categorization. Submitted for publication, 1998

http://research.microsoft.com/~sdumais/cikm98.doc

33. Mikheev A., Feature Lattics and maximum entropy models. In machine Learning,

McGraw-Hill, New York, 1999

34. Yang, Y. and Pedersen J., A comparative study on feature selection in text categorization.

Fourteenth International Conference on Machine Learning (ICML’97) pp 412-420, 1997

35. Berger A., Della Pietra S., Della Pietra V., A maximum entropy approach to natural

language processing, Computational Linguistics, 22 (1), pp 39-71, 1996