Tuesday, January 3, 2012

Multi-Threaded Simulation in Marss

Since we started working on Marss (back in Jan '10) we always wanted to take advantage of multi-core platforms for simulating the same multi-core platforms!! It was so frustrating that if we want to keep cycle-accurate simulations features it doesn't leave us too much room for parallelising execution as we have to wait for every core to execute one cycle.

Marss typically runs around 200KIPS so in one second it executes/simulates 200,000 instructions. Assuming that Marss is running on 2GHz machine, 1 second has approx. 2e9 (2 billion) clock ticks. Doing the simple math it takes roughly 2e9/2e5 = 1e4 (10,000) cycles to simulate 1 instruction. Someone would say that's a lot but consider that in each cycle simulator will execute 7 stage pipeline, collect myriad of statistics and iterate through hundreds of objects. So if we execute each simulated core in a separate pthread then we need to synchronize each pthread at every 10,000 cycles which ends up having too much synchronization overhead.

When we run higher number of simulated cores (more than 8) then simulation shows significant slowdown as it execute one cycle in each core in serial fashion. So we got the opportunity of using pthreads to parallelise the execution of many cores into different threads.

Overview of multi-threaded simulation in Marss

We have implemented a very basic multi-threaded simulation framework in Marss that allocates fix number of cores to each pthread and execute them in parallel to improve performance. The blog diagram shows that we synchronize all the threads at each cycle to maintain the cycle-accurate feature. Because of that the synchronization overhead was much higher when we used standard pthread-barrier. To minimize this overhead we have implemented a custom barrier that takes advantage of hardware cache-coherence. Using this custom barriers showed significant improvement in performance. As shown in the block diagram the master thread executes memory hierarchy and IO events so each pthread is only responsible to simulate 'core' models. There is still an opportunity to execute private caches (for example private L1 caches) in allocated pthread of that core but we haven't got much time to implement it. Also if we get rid of 'cycle-accurate' constrain then we can optimize this further to allow cores to be out-of-sync for fix number of cycles. There will be many more optimization done in future but for now with minimum changes to Marss, we are now able to run simulations in multi-threaded mode.

Show me the Numbers!!
To test the performance improvement in multi-threaded mode we have used following setup:
  • Benchmarks: bodytrack, canneal, ferret, and raytrace from Parsec
  • Machine: Octal core HT Intel Xeon E5520 running @ 2.27GHz
  • Simulation Configuration: Out-of-Order cores ranging from 8, 12, 16, 20 and 24 where each core has a private L2 cache and MESI based cached coherence
  • Simulation duration: Stop after 1 billion instructions

All the speedup numbers are compared to single threaded simulation mode. For 8 and 12 core we ran simulations with 2, 3, and 4 threads including master thread. As shown in the graph below for 8 core using 3 threads shows performance improvement of around 1.8x.

For 12 core using 4 pthreads doesn't show much performance improvement except 'bodytrack'. The reason for this behavior is due to ratio of ideal cores to active cores. In 'bodytrack' all the cores are executing benchmark thread where as in other benchmarks not all the cores are executing benchmark all the time, some of them are in 'cpu_ideal' loop.

The graph shows that for 16 cores going up-to 5 threads (3 cores for 4 pthreads and 4 cores for 1 pthread) gives around 2x performance improvement.
Going to 6 pthreads in 20 core configuration reduces the performance compared to 5 threads in 'ferret' and 'raytrace'. We haven't got enough time to find whats causing this behavior.
For 24 cores 'bodytrack' and 'raytrace' shows 2.5x performance improvement for 6 pthreads!! For 'canneal' and 'ferret' 6 and 7 pthread configuration shows performance hit compared to 5 pthread configuration, but it still shows around 2x speedup compared to single thread runs.

Get the source code
Source code is available in 'mt-sim' branch of github repo. Checkout using:
 $ git fetch origin
$ git checkout -t origin/mt-sim'

To enable multi-threaded simulation use following simconfig option:
-threaded-sim N # Where N is the number of pthread 
Please remember that this is alpha level code so be ready to hit the bugs and always report them!!

No comments:

Post a Comment