Friday, December 16, 2011

Synchronizing multiple simulation instances

Couple of months ago we introduced a new feature 'syncing multiple simulations instances' (available in 'features' branch right now). This feature allows users to run multiple instances of Marss that synchronize at given number of clock ticks. This blog post talks about what was the reason to implement such a feature.

Super-Duper fast Server Issue

We were running multiple instances of Marss on same machine, one running server and other running clients. Marss simulation speed varies a lot based on type of load so we noticed that when we run server and client together they drift apart in terms of simulated clocks as shown in the graph below.
As shown in the graph after 500 seconds of simulation the 'server' has executed around 80mil cycles where as 'client' has executed only 40mil cycles, a gap of 40+ million of cycles. This gaps keeps increasing as we keep running the simulations. Due to this high difference between two simulations it seems that 'server' is running with more than twice the speed as 'client' and all the request timing measured by 'client' are unrealistic as some of them posted request response within half nano-second. In real life the difference between two machines do exists but not this much - high end servers mostly run around 3 to 3.2 GHz and normal clients (laptops/desktop machines) run around 2 to 2.8 GHz speed. So due to this behavior our study of server and client benchmarks was flawed and we needed to fix the relative speed of each simulation instance.

Synchronizing Simulations

Once we realized the issue with this setup we looked into multiple options to keep the multiple instances in sync so the clock don't drift apart too much as we run simulations for long time. First thing I tried was to limit the maximum number of cycles to execute in given time frame. The issue with this technique was when each instance is capable of running faster even then we were limiting the speed and our total simulation time was increased by more than 2x.

So I decided to give SysV Semaphores a try as described in previous blog post to sync multiple processes using semaphores as barrier. The implementation was very simple, each instance is allowed to execute fix number of cycles between each barrier. So we ran some simulations with different interval size and found out that 200K cycles barrier was good enough to reduce the clock drift between each instance while minimizing the effect on simulation speed.
Here is the resulting graph after running simulations with sync feature. As shown in the image now both the 'server' and 'clients' execute with same frequency so request time measured by 'clients' are now realistic.

To use this feature provide '-sync N' simconfig option to each simulation instance that you want to run in synchronization. Here the N is number of cycles to execute between each sync.

No comments:

Post a Comment