
 
Then the agents are updated with their target. For an 
agent placed in the bottom of the environment the 
initial target is the top row. In the CUDA program 
each agent is considered to be a single thread. The 
total number of threads launched is equal to total 
number of agents present in the environment. First, 
it is required to ensure that enough threads can be 
launched to cover all the agents present in the 
environment. Then the threads are arranged into 
grids and blocks. In each of the simulation steps a 
kernel function is launched which carries out the 
agent movement.  
Inside the kernel function, before carrying out 
the LEM algorithm, an agent checks whether it is 
placed in the target column. If true, then it verifies 
whether the immediate cell in the forward direction 
is empty or not. If empty then no further calculation 
is performed and the agent proceeds forward. 
Otherwise, a cell gets selected using LEM. The 
pseudo code is shown below:   
if(target_col == present column &&  
 forward_cell == 0) 
   move one cell forward 
else 
   Calculate LEM. 
No movement is observed when all neighbouring 
cells are occupied. 
4.1.1  Challenges in Implementation 
The primary motivation of the implementation of the 
model on the GPU was to gain speed in the 
simulation while emulating the situation where the 
pedestrians are capable of decision making. 
Decisions of each agent are considered to be 
independent of the other. The biggest challenge in 
implementation of the algorithm was to keep the 
total number of agents same, without any loss. As 
mentioned, each of the agents is launched as single 
thread and all of them are executed in parallel. So a 
situation could arise when two agents try to access 
the same environment position at the same time and 
a loss of agent could occur. To avoid this kind of 
scenario, atomic features of CUDA are used.  
In the atomic operations when one of the threads 
is performing an operation on a particular memory 
location, residing either in global or shared memory, 
then it does not get interfered with by operations of 
other threads. An atomic function performs a read-
modify-write operation on 32 or 64 bit word which 
is residing in global or shared memory. Inside the 
kernel function, the agent first finds out the number 
of neighbouring empty cells, calculates the distance 
of the empty cell from the target and arranges them 
in the ascending order. After that a random number 
is generated which decides the cell number to be 
selected. Once the cell is chosen, the movement of 
the agent is achieved by calling the atomicExch() 
function. In this way the numbers of agents are kept 
intact in every simulation step and no agent 
(memory) gets overwritten throughout the operation. 
4.1.2  Generation of Random Numbers 
Inside the kernel function, after the agents calculates 
the distance of the available neighbouring empty 
cells to the target and arranging them in the 
ascending order, a random number is generated. 
Random numbers are generated from a normal 
distribution. A separate random number is obtained 
for each agent. This is obtained by using cuRAND 
library of which comes with the CUDA SDK. 
Before the start of the simulation, a setup kernel is 
launched once for the total number of pedestrians 
present in the environment.  Inside this setup kernel 
there is the cuRAND application programming 
interface (api) curand_init(). This kernel function is 
essentially responsible to generate a seed which is 
later fed to the main random number generator. The 
random number from the normal distribution is 
generated by using the curand_normal() api. This 
normal distribution api is responsible for generating 
a random number having a mean of 0 and standard 
deviation of 1. But the result is multiplied by 3 to 
obtain the desired standard deviation. 
5 SPEEDUP 
Simulation of the LEM for a large number of agents 
is very computationally intensive. The primary 
motivation of using GPU for the implementation of 
the LEM algorithm is to accelerate the simulation 
process. The time for the simulation process is 
measured for both the CPU and GPU 
implementations. The time is only measured for the 
simulation and not for any memory transaction in 
case of the GPU. As mentioned, the GPU used is 
NVIDIA® GEFORCE™ 560ti and CPU used is 
CPU is Intel Xeon E3-1280 (server grade). 
For the GPU, time for the simulation process is 
measured by using cudaevent  functions. The 
computation time both for CPU and GPU is given in 
the Figure 4a and the speedup for the simulation is 
provided in the Figure 4b. In Figure 4a the time is 
measured in seconds along the Y-axis with the the 
number of agents along the X-axis. This speedup is 
measured using two groups of agents only (bi-
SIMULTECH2013-3rdInternationalConferenceonSimulationandModelingMethodologies,Technologiesand
Applications
372