PEDESTRIAN RECOGNITION FOR INTELLIGENT

TRANSPORTATION SYSTEMS

D. Fern

andez, I. Parra, M. A. Sotelo, L. M. Bergasa, P. Revenga, J. Nuevo, M. Oca

Department of Electronics. University of Alcal

Alcal

a de Henares, Madrid, Spain

Keywords:

Pedestrian Recognition, Support Vector Machines, Stereovision, Intelligent Transportation Systems.

Abstract:

This paper describes a binocular vision-based pedestrian recognition System. The basic components of pedes-

trians are ﬁrst located in the image and then combined with a SVM-based classiﬁer. This poses the problem of

pedestrian detection and recognition in real, cluttered road images. Candidate pedestrians are located using a

subtractive clustering attention mechanism. A distributed learning approach is proposed in order to better deal

with pedestrians variability, illumination conditions, partial occlusions and rotations. The performance of the

pedestrian recognition system is enhanced by a multiframe validation process. By doing so, the detection rate

is largely increased. A database containing hundreds of pedestrian examples extracted from real trafﬁc images

has been created for learning purposes. We present and discuss the results achieved up to date.

1 INTRODUCTION

This paper describes a binocular vision-based pedes-

trian recognition system in the framework of Intelli-

gent Transportation Systems (ITS) technologies. In

our approach, the basic components of pedestrians

are ﬁrst located in the image and then combined with

a SVM-based classiﬁer. The challenge is to use a

couple of FireWire digital cameras as input, in or-

der to achieve a low cost ﬁnal solution that meets

the requirements needed to undertake serial produc-

tion. The digital cameras provide range measure-

ments using the laws of stereo vision. Some pre-

vious works use available sensing methods such as

laserscanner (Fuerstenberg et al., 2002), stereovision

(Gavrila et al., 2004) (Grubb et al., 2004), or a com-

bination of both (Labayrade et al., 2003). Only a

few works deal with the problem of monocular pedes-

trian recognition using pattern recognition techniques

(Shashua et al., 2004). Pedestrian recognition is a

challenging problem in real trafﬁc, cluttered environ-

ments. This is a complex problem as long as it re-

quires that the object class exhibits high interclass

and low intraclass variability. In addition, pedestrian

recognition should perform robustly under variable il-

lumination conditions, variable rotated positions, and

even if some of the pedestrian parts or limbs are par-

tially occluded.

Object recognition techniques can be classiﬁed into

three major categories, as described in (Mohan et al.,

2001). The ﬁrst category is represented by model-

based systems in which a model is deﬁned for the

object of interest and the system attempts to match

the model to different parts of the image in order

to ﬁnd a ﬁt. Unfortunately, pedestrians can be re-

garded as quite a variable class that makes it impos-

sible to deﬁne a model that represents the class in an

accurate, general way. In consequence, model-based

systems are of little use for pedestrians recognition

purposes. The second category are image invariance

methods which perform a matching based on a set of

image pattern features that, supposedly, uniquely de-

termine the object being searched for. Pedestrians do

not exhibit any deterministic image pattern relation-

ships because of its large variability (size, pose and

so forth). For this reason, image invariance methods

are not a viable option in order to solve the pedestrian

recognition problem. The third category of object de-

tection techniques is characterised by example-based

learning algorithms. The salient features of a class

are learnt by the system based on a set of examples.

This type of technique can provide a solution to the

pedestrian recognition problem as long as the follow-

ing conditions are met.

• A sufﬁciently large number of pedestrians exam-

ples are contained in the database.

292

Fernández D., Parra I., A. Sotelo M., M. Bergasa L., Revenga P., Nuevo J. and Ocaña M. (2005).

PEDESTRIAN RECOGNITION FOR INTELLIGENT TRANSPORTATION SYSTEMS.

In Proceedings of the Second International Conference on Informatics in Control, Automation and Robotics - Robotics and Automation, pages 292-297

DOI: 10.5220/0001179702920297

 SciTePress

• The examples are representative of the pedestrian

class in terms of variability, illumination condi-

tions, position and size in the image.

Example-based techniques have been previously

used in natural, cluttered environments for pedestrian

detection (Shashua et al., 2004) (Gavrila et al., 2004).

In general, these techniques are easy to use with ob-

jects composed of distinct identiﬁable parts arranged

in a well-deﬁned conﬁguration. A distributed learn-

ing approach based on components (Mohan et al.,

2001) is more efﬁcient for object recognition in real

cluttered environments than holistic approaches (Pa-

pageorgiou and Poggio, 2000). Distributed learning

techniques can deal with partial occlusions and are

less sensitive to object rotations. However, in spite

of their ability to detect objects in real images, we

propose to reduce the pedestrians searching space in

an intelligent manner, based on the road image, so

as to increase the performance of the detection mod-

ule. Accordingly, road lane markings are detected and

used as the guidelines that drive the pedestrian search-

ing process. The area contained by the limits of the

lanes determines the zone of the real 3D scene where

pedestrians are searched for. The objects found in the

searching area are passed on to the pedestrian recog-

nition module. This helps reduce the rate of false pos-

itive detections. In case that no lane markings are de-

tected, a basic area of interest is used instead covering

the front part ahead of the ego-vehicle. The descrip-

tion of the lane marking detection system is provided

in (Sotelo et al., 2005). The rest of the paper is or-

ganised as follows: section II provides a description

of the candidate selection mechanism. Section III de-

scribes the pedestrian recognition system. The results

achieved up to date are presented in section IV. Fi-

nally, section V summarizes the conclusions and fu-

ture work.

2 CANDIDATE SELECTION

We have developed a calibrated stereo platform and

calculated the intrinsic parameters for each camera,

and the extrinsic parameters between them, in order

to obtain the fundamental matrix that deﬁnes the sys-

tem epipolar geometry. This way the perfect phys-

ically aligning between cameras that implies the as-

sumption of parallel epipolar lines, is not necessary,

because the stereo calibration process deﬁnes mathe-

matically the geometric relationships for the cameras

(Xu and Zhang, 1996).

The ﬁrst task is image preprocessing which has two

steps: normalize intensity values, to correct for dif-

ferences between the two images, and eliminate ra-

dial and tangential distortion. Once here, we apply a

Canny algorithm for feature extraction on the left im-

age. The Canny image provides a good representation

of the discriminating features of pedestrians, as de-

picts Figure 1. Features such as heads, arms and legs

are visible and distinguishable and are not affected

by colours or intensity. It gives us some indications

about discriminating zones for the pedestrian recog-

nition system.

Figure 1: Some Canny images examples. Upper row:

pedestrians examples. Bottom row: non pedestrians exam-

ples

In order to extract 3D scene information some au-

thors use disparity map techniques combined with

the v-disparity segmentation (Grubb et al., 2004)

(Labayrade et al., 2003). This option was discarded

because of the disadvantages associated with disparity

computation algorithms: prior to disparity map gener-

ation the image pair has to be rectiﬁed to ensure good

correspondence matching. In addition the informa-

tion for performing generic obstacles detection is de-

ﬁned with a vertical line into the v-disparity image.

This implies managing very little information to de-

tect obstacles, which works for big object detection

as vehicles, but could not be enough for smaller ob-

ject detection such as pedestrians. After solving the

correspondence problem, our approach creates a 3D

points map which origin is placed at the left camera.

Using the fundamental matrix for each Canny’s de-

tected point we search the corresponding point in the

other image along its epipolar line (ﬁxing the maxi-

mum distance between corresponding points in order

to reduce the cost of matching).

The correspondence problem can be solved using

a wide spectrum of matching techniques. But most

recent successes have been in area-based algorithms.

Speciﬁcally the Zero Mean Normalized Cross Corre-

lation has performed most robustly (Boufama, 1994).

This algorithm seeks -for a point given on the left

image- the larger correlation response for a point of

the right image, taking into account the relevance of

the window size. As the window size decreases, the

discriminatory power of the area-based criterion is de-

creased and some local maximum in ZMNCC could

have been found in the search regions. Moreover, con-

tinually increasing the window size causes the perfor-

PEDESTRIAN RECOGNITION FOR INTELLIGENT TRANSPORTATION SYSTEMS

293

mance to degrade because of occlusion regions and

smoothing of disparity values across depth bound-

aries. In consequence the correspondences are often

not correct.

According to the previous statements we need a ﬁl-

tering criteria in order to reject outliers. We create

a XZ map (bird’s eye map) and ﬁrst, we extract 3D

points within the pedestrian searching area (after the

road lanes marking detecting system). Second, road

surface points (road drawings) and high points, above

2m, are removed. And ﬁnally we ﬁlter the XZ map

according to a neighbourhood criterion. Figure 2 de-

picts the ﬁltering criteria.

Figure 2: Filtering criteria and XZ maps.

As we can see in Figure 3, the appearance of pedes-

trians in 3D space is represented by an uniformly dis-

tributed set of points. Data clustering techniques is

concerned with the partitioning of a data set into sev-

eral groups such that the similarity within a group

is larger than that among groups. The common ap-

proach of clustering techniques is to ﬁnd clusters cen-

ters that will represent each cluster and normally the

number of clusters is known beforehand. This is the

case of K-means based algorithms. In our case the

number of clusters is unknown, outlier effects have to

be reduced or completely eliminated and it is necce-

sary to deﬁne speciﬁc space characteristics in order to

group different pedestrians in the scene. For these rea-

sons we use the Subtractive Clustering (Chiu, 1994)

that is applied in Fuzzy Model Identiﬁcation Systems

and is based on a measure of the density of data

points. The idea is to ﬁnd regions in the feature space

with high densities of data points. The point with the

highest number of neighbours is selected as centre for

a cluster. The data points within a prespeciﬁed neigh-

borhood radius are then removed (subtracted), and the

algorithm looks for a new point with the highest num-

ber of neighbours.

We carry out this algorithm using a 3-dimensional

neighbourhood radius r

= (r

, r

). Since

each data point is a candidate for a cluster centre, a

density measure at data point µ

= (x

, y

, z

) is de-

Figure 3: Left: 2D points into left image. Right: 3D points

location.

ﬁned as

j=1

exp



−

||µ

− µ

/2)



(1)

Let µ

be the point with highest density and D

its

density measure. Next, the density measure for each

point µ

is revised by the formula

′

= D

− D

exp



−

||µ

− µ

/2)



(2)

where r

deﬁnes a neighbourhood to be reduced in

density measure and it is normally larger than r

to prevent closely spaced cluster centres, typically

= 1.5r

= (1.5r

, 1.5r

).After the den-

sity measure for each point is revised, the next clus-

ter centre is selected and all the density measures are

revised again. The process is repeated until a sufﬁ-

cient number of cluster centres are generated. After

applying subtractive clustering to a set of input data,

each cluster represents a candidate. Pedestrian classi-

ﬁcation will be done in 2D in the ROI deﬁned by the

image projection of the 3D candidate regions. Figure

4 depicts the multicandidate regions of interest gen-

erated by the clustering mechanism in a sequence of

images. Nonetheless, this ﬁgure is bound to change

depending on trafﬁc conditions.

3 PEDESTRIAN RECOGNITION

The appearance of pedestrians in the scene presents a

wide variability (moving longitudinally, moving later-

ally, stationary, etc. ). In consequence, it makes sense

to use a distributed learning approach in which each

pedestrian body part is independently learnt by a spe-

cialized classiﬁer in a ﬁrst learning stage. The body

local parts are then integrated by another classiﬁer in a

ICINCO 2005 - ROBOTICS AND AUTOMATION

294

Figure 4: Generation of candidate regions of interest in a

sequence of images.

second learning stage. The proposed approach can be

regarded as a hierarchical one. By using independent

classiﬁers in a distributed manner the learning process

is simpliﬁed, as long as a single classiﬁer has to learn

individual features of local regions in certain condi-

tions. Otherwise, it would be difﬁcult to attain an ac-

ceptable result using a holistic approach. We have

considered a total of 6 different sub-regions for each

candidate region of interest which has been ﬁt to a size

of 24×72 pixels. The ﬁrst sub-region is located in the

head. The arms and legs are covered between the sec-

ond and ﬁfth regions. In addition we deﬁne a region

located between the legs, covering an area with rel-

evant information depending on the pedestrian pose.

The locations of the six-regions have been chosen in

an attempt to detect coherent pedestrian features as

depicted in ﬁgure 5.

Figure 5: Left: composition of a candidate region of inter-

est into 6 sub-regions. Right: examples in a sequence of

images.

A set of features must be extracted from each sub-

region and fed to the classiﬁer. These are expected

to be invariant to local shifts of candidate region of

interest caused by change of pose and articulation

of the pedestrian’s arms and limbs. Several features

extractors have been proved: co-occurrence matrix

over canny edge extraction and over 32 levels normal-

ized image, normalized orientation histogram over the

128 levels normalized image, image gradient magni-

tudes and orientation and ﬁnally texture unit num-

ber (NTU). The co-occurrences matrices over canny

edge extraction are computed by the accumulated ad-

dition of the 4 possible bits combinations on the im-

age (00, 01, 10, 11) and yield one 2 × 2 matrix for

each direction. As we have choosed 4 orientations

(90

◦

, 45

◦

, 0

◦

, 315

◦

) we get 4 matrices per sub-region

and therefore a 16-element vector. When we compute

the co-occurrence over the normalized image instead

dealing with the cany edge extraction, the image has

been normalized to 32 levels so that the co-occurrence

matrices are not too large. By doing this we get 4

32 × 32 matrices per sub-region which seems a much

more reasonable size. The normalized orientation his-

togram adds the difference magnitude between pix-

els in the 4 orientations delivering, this way, 4 128-

lengthed vectors per sub region. Image gradient mag-

nitudes and orientation have been directly fed to the

classiﬁer and their size depends on the sub-region’s

one. Finally, NTU extracts the local texture informa-

tion of a neighbourhood of pixels (Wang, 1990) and

the vectors size also depends on the sub-region size.

The Support Vector Machines (SVM) classiﬁer,

proposed by (Vapnik, 1999) have yielded excellent

results in various data classiﬁcation tasks, including

people detection (Papageorgiou and Poggio, 2000).

The SVM algorithm uses structural risk minimization

to ﬁnd the hyperplane that optimally separates two

classes of objects. We use it in order to classify each

candidate as either pedestrian or non-pedestrian. The

global training strategy is carried out in two stages.

In a ﬁrst stage, separate SVM-based classiﬁers are

trained using individual training sets that represent a

subset of a sub-region. Each SVM classiﬁer produces

an output between -1 (non-pedestrian) and +1 (pedes-

trian). Accordingly, it can be stated that this stage

provides classiﬁcation of individual parts of the can-

didate sub-regions. In a second step, the outputs of

all classiﬁers are merged in a simple classiﬁer which

makes a decision based on a majority criterion in or-

der to provide the ﬁnal classiﬁcation result. Once

here, each candidate classiﬁed as pedestrian is dy-

namically tracked by a Kalman ﬁlter which decreases

the false negative rate.

4 RESULTS

The system was implemented on a Pentium IV at 2.4

Ghz running the Knoppix Linux Operating System.

With 320×240 pixel images resolution, the complete

algorithm runs at an average rate of 20 frames/s de-

pending on the number of pedestrian being tracked

PEDESTRIAN RECOGNITION FOR INTELLIGENT TRANSPORTATION SYSTEMS

295

Table 1: SVM classiﬁcation results.

Distributed SVM Classiﬁer Holistic SVM Classiﬁer

detection rate false positive rate false negative rate detection rate false positive rate false negative rate

Coocurrence over

normalized image

0.7437 0 0.2563 0.7789 0 0.2211

Coocurrence over

canny image

0.8643 0 0.1357 0.8593 0.0653 0.0754

Canny image 0.7940 0 0.2060 0.7236 0 0.2764

Magnitude orienta-

tion

0.7236 0 0.2764 0.7136 0 0.2864

Normalizad Orien-

tation Histogram

0.9246 0 0.0754 0.8894 0.0402 0.0704

Texture Unit Num-

ber

0.8593 0 0.1407 0.7136 0 0.2864

and their position. Speciﬁcally the average rate have a

strong dependency on the number of correlated points

because of the correlation computacional cost, which

consumes 80% of the whole processing time.

The candidate selection system has proved to be

robust in various illumination conditions, different

scenes and distances up to 25m, developing a prac-

tical false-negative rate of 0%, after the kalman ﬁlter-

ing. Once the selection of pedestrians as candidates

is granted the false-positive rate is expected to be cor-

rected by the SVM classiﬁer.

We created a database containing 1000 samples

of pedestrians and non-pedestrian in different situa-

tions. The number of pedestrians samples in the train-

ing sets was chosen to be similar to the number of

non-pedestrian samples. These ones were extracted

from recorded images acquired in real experiments

onboard a road vehicle under real trafﬁc conditions.

All training sets were created at day time conditions

using the TSetBuilder tool (Nuevo, 2005), speciﬁcally

developed in this project for this purpose. By us-

ing the TSetBuilder tool different candidate regions

are manually selected in the image on a frame-by-

frame basis.Special attention was given to the selec-

tion of non-pedestrian samples. If we select simple

non-pedestrian examples (for instance, road regions)

the system learns very quickly but it does not develop

enough discriminating capability in practice, as the

attention mechanism can select a region of the image

that might be very similar to a pedestrian but it is not

a pedestrian in reality. The training of all SVM clas-

siﬁers was performed using the free-licence LibTorch

libraries for Linux. We obtained different detection

rates depending on the feature extractor as depicted in

Table 1 in a test set containing 500 images. The per-

formance of the single-frame recognition process is

largely increased by using multiframe validation. The

probability of a candidate region being classiﬁed as

pedestrian is modelled as a Bayesian random variable.

Accordingly, its value is recomputed at each frame as

a function of the outputs provided by the single-frame

classiﬁer and by a Kalman ﬁlter used for pedestrian

tracking. Figure 6 shows an example of pedestrian

detection and tracking.

5 CONCLUSION

We have developed a binocular multi-frame pedes-

trian classiﬁcation system based on Support Vector

Machines (SVM). The learning process has been sim-

pliﬁed by decomposing the candidate regions into 6

local sub-regions that are easily learned by individual

SVM classiﬁers. The complete classiﬁer can be re-

garded as a hierarchical one. The distributed approach

has yielded, superior performance, over the same data

set,compared to the holistic classiﬁer version. The re-

sults achieved up to date with a set of 1000 samples

are encouraging. Nevertheless they still need to be

improved before being safely used as an assistance

driving system onboard road vehicles in real trafﬁc

conditions. For this purpose, the content of the train-

ing sets will be largely increased by including new

and more complex samples that will boost the classi-

ﬁer performance, in particular when dealing with dif-

ﬁcult cases. We aim at enhancing the classiﬁer abil-

ity to discriminate those cases by incorporating thou-

sands of them in the database. In addition, the at-

tention mechanism will be reﬁned in order to provide

more candidates around the original candidate region.

This will reduce the number of candidate regions that

only contain a part of a pedestrian, i.e., those cases

in which the entire pedestrian is not completely vis-

ible in the candidate region due to a misdetection of

the attention mechanism. Finally, a gait recognition

process will be introduced in order to enhance the

shape-based pedestrian detection algorithm.

ICINCO 2005 - ROBOTICS AND AUTOMATION

296

Figure 6: Pedestrian detection and tracking in a sequence of images.

ACKNOWLEDGMENT

This work has been funded by Research Projects

CICYT DPI2002-04064-05-04 and FOM2002-002

(Ministerio de Fomento, Spain).

REFERENCES

Boufama, B. (1994). Reconstruction tridimensionnelle en

vision par ordinateur: Cas des cameras non etalon-

nees. In PhD thesis. Institut National Polytechnique

de Grenoble, France.

Chiu, S. (1994). Fuzzy model identiﬁcation based on cluster

estimation. In J. of Intelligent and Fuzzy Systems. vol.

2, no. 3, pp. 267-278, 1994.

Fuerstenberg, K. C., Dietmayer, K. J., and Willhoeft,

V. (2002). Pedestrian recognition in urban trafﬁc

using a vehicle based multilayer laserscanner. In

In Proc. IEEE Intelligent Vehicles Symposium. Ver-

sailles, France, June 2002.

Gavrila, D. M., Giebel, J., and Munder, S. (2004). Vision-

based pedestrian detection: The protector system. In

In Proc. IEEE Intelligent Vehicles Symposium. pp. 13-

18, Parma, Italy, June 14-17.

Grubb, G., Zelinsky, A., Nilsson, L., and Rilbe, M. (2004).

3d vision sensing for improved pedestrian safety. In

In Proc. IEEE Intelligent Vehicles Symposium. pp. 19-

24, Parma, Italy.

Labayrade, R., Royere, C., Gruyer, D., and Aubert (2003).

Cooperative fusion for multi-obstacles detection with

use of stereovision and laser scanner. In International

Conference on Advanced Robotics. pp. 1538-1543.

Mohan, A., Papageorgiou, C., and Poggio, T. (2001).

Example-based object detection in images by compo-

nents. In IEEE Transactions on Pattern Analisis and

Machine Intelligence. Vol. 23 No. 4.

Nuevo, J. (2005). Testbuilder tutorial. technical report 2005.

ftp://www.depeca.uah.es/pub/vision/SVM/manual.pdf.

Papageorgiou, C. and Poggio, T. (2000). A trainable system

for object detection. In Intl J. Computer Vision. Vol.

38, No. 1, pp. 15-33.

Shashua, A., Gdalyahu, Y., and Hayun, G. (2004). Pedes-

trian detection for driving assistance systems: single-

frame classiﬁcation and system level performance. In

In Proc. IEEE Intelligent Vehicles Symposium. pp. 1-

6, Parma, Italy.

Sotelo, M. A., Nuevo, J., Bergasa, L. M., and Ocana, M.

(2005). Road vehicle recognition in monocular im-

ages. In submitted to ISIE 2005. Duvrobnik, Croatia

June 2005.

Vapnik, V. (1999). The nature of statistical learning theory.

Springer Verlag.

Wang, L. (1990). Texture unit, texture spectrum and texture

analysis. In IEEE Transactions on Geosciences and

Remote Sensing. Vol. 28, No 4, pp. 509-512 (90-19).

Xu, G. and Zhang, Z. (1996). Epipolar Geometry in

Stereo, Motion and Object Recognition: A Uniﬁed

Approach. Kluwer Academic Publishers, Dordrecht,

Boston, London, 1st edition.

PEDESTRIAN RECOGNITION FOR INTELLIGENT TRANSPORTATION SYSTEMS

297