Experiences with Marss and Beyond

Tuesday, May 1, 2012

Using R to Analyze, Manipulate and Generate graphs from Periodic Statistics

As most of you might know that MARSS provides a simple interface to generate periodic dump of selected statistics counters. You can easily input this periodic stats file into Excel or similar application and generate graphs or process the data according to your requirements. But many times, due to long simulation runs, the size of this data grows significantly large and most of the spreadsheet applications crashes or become too slow to handle it. Recently I ran into similar issue while generating some results for our research paper where I needed to plot the data from CSV file and also perform some simple equations on selected columns. Then I discovered 'The R-Project' which provides a great environment to plot and manipulate large amount of data. This blog will show you how to use R to generate periodic graphs and perform simple mathematics operations on one or more columns.

Periodic graph of L2 Cache Miss Ratio generated using R

Generating Periodic Stats file

In MARSS code call 'enable_periodic_dump()' on the StatObject, then compile and give -time-stats-logfile TIME_STATS_FILENAME and -time-stats-period 100K options in your simconfig file to dump value of your stats counter for every 100K cycle interval. The output file is in plain CSV format and contains sim_cycle count as first column and one stats counter per column.

Reading a file and Plotting a simple graph

To start with R, install the core packages using apt-get or yum and invoke the interactive shell by typing 'R' in your terminal. To read in the CSV file using 'readfile.csv()' function:

> data <- read.csv(YOUR_CSV_FILENAME)

In 'R' assigning a value to variable is done using '<-' symbol. Now 'data' variable contains all our columns and can be accessed using following two ways:

> data[["sim_cycle"]]

> data$sim_cycle

Above two command will print the "sim_cycle" column from our periodic statistics file. Now lets assume that your periodic stats file has columns for L2 cache read-hit. You can easily print the L2 cache read-hit column using following command:

> data[["base_machine.L2_0.cpurequest.count.hit.read.hit"]]

Printing this data on commandline is not so useful (!) so lets plot it:

> plot(data[["base_machine.L2_0.cpurequest.count.hit.read.hit"]])

This will show a plot with default plotting options printing one 'dot' for each datapoint. The x-axis of this plot is the row number for each data-point, but for analysis we want to see L2 cache read-hit with sim_cycle counter. Use following command to plot L2 cache read-hit count with sim_cycle on x-axis:

> plot(data[["sim_cycle"]], data[["base_machine.L2_0.cpurequest.count.hit.read.hit"]])

Simple Operations on Multiple Columns

Once you understand how to access various data columns in R, its pretty simple to perform some simple mathematics operations on one or more column. For example to generate L2 cache miss rate plot:

> l2_read_hit <- data[["base_machine.L2_0.cpurequest.count.hit.read.hit"]]

> l2_write_hit <- data[["base_machine.L2_0.cpurequest.count.hit.write.hit"]]

> l2_read_miss <- data[["base_machine.L2_0.cpurequest.miss.read"]]

> l2_write_miss <- data[["base_machine.L2_0.cpurequest.miss.write"]]


# Calculate total hit/miss and total access

> l2_hit <- l2_read_hit + l2_write_hit

> l2_miss <- l2_read_miss + l2_write_miss

> l2_access <- l2_hit + l2_miss


# Now generate miss ratio and plot it

> l2_miss_ratio <- l2_miss / l2_access

> plot(data[["sim_cycle"]], l2_miss_ratio, type="l")

# type="l" will print lines instead of dots

Above example shows how easy it is to generate periodic graph of L2 miss ratio for your simulation results using periodic stats and R.

EDIT : Here is a sample R script that generates IPC, Cache miss rate and Cache miss ratio graphs from periodic dump files.

References for R

The R-Project
Detailed introduction to R environment and programming

Monday, March 12, 2012

Fast Forwarding and Simpoint support in MARSS

One of the most requested feature in MARSS has been fast-forwarding N instructions before starting simulation. Recently we have developed some hooks into QEMU's translation logic to count number of instructions emulated. With this logic we are now able to support fast-forwarding and simpoint in MARSS. Both of these features are available in features branch right now.

Fast-Forwarding
Following new simconfig options are added for fast-forwarding support in MARSS.

-fast-fwd-insns N: Fast-forward N number of total instructions. This includes user level and kernel level instruction across all emulated CPUs. After specified limit is reached it will switch to simulation mode.
-fast-fwd-user-insns N: Fast-forward N number of user level instructions. This mode will emulate kernel level instructions but doesn't count them. After specified user level instructions are executed it will switch to simulation mode and it will simulate both user and kernel level instructions.
-fast-fwd-checkpoint CHK_NAME: Create a checkpoint named 'CHK_NAME' after fast-forwarding specified amount of instructions. It will kill simulation instance after a checkpoint is created.

As mentioned above the default behavior is to switch to simulation mode after specified instructions are emulated. Because of non-deterministic behavior of kernel level execution in emulation mode we also support creating a checkpoint after fast-forwarding so users can use fast-forwarding only once and then rely on checkpoint to start simulation at specific RIP everytime.

For multicore emulation/simulation, fast-forwarding specified amount of instruction is little complicated because many times only few of CPUs are executing any code and others are in idle mode. At first MARSS equally divides total number of instructions to emulate and allocate them to all CPUs in machine. When part of those CPUs are in idle mode then MARSS will re-allocate portion of remaining instruction count from idle CPUs to non-idle CPUs to reach to specified instructions limit.

Simpoint
Simpoints has been one of the most used and reliable method in computer-architecture research to simulate part of applications that is representative of full application run. We decided to implement support for Simpoints to evaluate performance of different applications with real-hardware, but more on that later. MARSS uses simpoint file to create checkpoints after specified instructions are emulated. To create checkpoints based on simpoints and how to use weights file and mstats.py to calculate weighted IPC please refer to wiki page on Simpoints.

Counting Emulated Instructions
QEMU's emulation engine - TCG is designed to be fast which first convert instruction into list of micro-instructions and then convert these micro-instructions into binary buffer that can be executed and re-executed without any change. This makes little bit difficult to directly count the number of instructions emulated.

Recently we developed support for Simpoints in which we need to count number of instructions emulated and create a checkpoint after specific number of instructions. During this development we figured out a way to add hooks into each translated binary block to count number of instructions emulated at run-time. We added a counter to each CPU context which is set to number of instructions a CPU context is allowed to execute. In each translated block we added a hook that decrements CPU context's instruction counter by number of instructions in the block. When the counter reaches to 0 we switch to simulation mode or create a checkpoint based on user's configuration options.

Thursday, February 2, 2012

Major Bug fixe and -Wall flag

Recently we have pushed many changes related to bug fixes and code cleanup. Normally I don't blog about releasing bug-fixes patch on any branch but these ones deserve a special post.

The first big fix is related to CPU ticks and clock value when simulation is started. When we switch to simulation mode we slow down the cpu ticks and clock based on simulation speed and user configured machine speed (using -corefreq parameter). During the switching we store the current cpu ticks and clock offset and based on sim_cycle (number of simulated cycles) we report the CPU ticks when 'rdtsc' is executed. But by default MARSS was not setting the cpu ticks offset correctly and due to this when a CPU context is restored from a checkpoint sometimes clock value is much lower or too higher than previous ticks and it crashes the VM kernel. After the fix now we restore correct cpu ticks and clock offset to prevent such kernel crashes. You can find the patch here.

Second big fix is not related to any specific bug but its more related to code cleanup. Now we have enabled -Wall compilation flag by default in simulator code. After enabling this flag gcc produced thousands of lines of warning in simulator code. We have removed all the warnings and clean up the code to be more reliable. After the cleanup I ran some simulations that were crashing earlier in VM mode, and they are now running without any issue!! By eliminating compiler warnings we have removed bugs that were really difficult to hunt (and we never knew they existed). This fix is a series of patches applied on 'features', 'qemu' and 'mt-sim' branches. Please pull the changes from any of this branch and merge with your local repository.

We encourage you to test out these changes and report any issues. We are running sanity check on 'features' branch and once we are confidence about these changes we will release 0.3 version soon.

PS: We have also added support for two new instructions INS and OUTS.

Tuesday, January 3, 2012

Multi-Threaded Simulation in Marss

Since we started working on Marss (back in Jan '10) we always wanted to take advantage of multi-core platforms for simulating the same multi-core platforms!! It was so frustrating that if we want to keep cycle-accurate simulations features it doesn't leave us too much room for parallelising execution as we have to wait for every core to execute one cycle.

Marss typically runs around 200KIPS so in one second it executes/simulates 200,000 instructions. Assuming that Marss is running on 2GHz machine, 1 second has approx. 2e9 (2 billion) clock ticks. Doing the simple math it takes roughly 2e9/2e5 = 1e4 (10,000) cycles to simulate 1 instruction. Someone would say that's a lot but consider that in each cycle simulator will execute 7 stage pipeline, collect myriad of statistics and iterate through hundreds of objects. So if we execute each simulated core in a separate pthread then we need to synchronize each pthread at every 10,000 cycles which ends up having too much synchronization overhead.

When we run higher number of simulated cores (more than 8) then simulation shows significant slowdown as it execute one cycle in each core in serial fashion. So we got the opportunity of using pthreads to parallelise the execution of many cores into different threads.

Overview of multi-threaded simulation in Marss

We have implemented a very basic multi-threaded simulation framework in Marss that allocates fix number of cores to each pthread and execute them in parallel to improve performance. The blog diagram shows that we synchronize all the threads at each cycle to maintain the cycle-accurate feature. Because of that the synchronization overhead was much higher when we used standard pthread-barrier. To minimize this overhead we have implemented a custom barrier that takes advantage of hardware cache-coherence. Using this custom barriers showed significant improvement in performance. As shown in the block diagram the master thread executes memory hierarchy and IO events so each pthread is only responsible to simulate 'core' models. There is still an opportunity to execute private caches (for example private L1 caches) in allocated pthread of that core but we haven't got much time to implement it. Also if we get rid of 'cycle-accurate' constrain then we can optimize this further to allow cores to be out-of-sync for fix number of cycles. There will be many more optimization done in future but for now with minimum changes to Marss, we are now able to run simulations in multi-threaded mode.

Show me the Numbers!!
To test the performance improvement in multi-threaded mode we have used following setup:

Benchmarks: bodytrack, canneal, ferret, and raytrace from Parsec
Machine: Octal core HT Intel Xeon E5520 running @ 2.27GHz
Simulation Configuration: Out-of-Order cores ranging from 8, 12, 16, 20 and 24 where each core has a private L2 cache and MESI based cached coherence
Simulation duration: Stop after 1 billion instructions

All the speedup numbers are compared to single threaded simulation mode. For 8 and 12 core we ran simulations with 2, 3, and 4 threads including master thread. As shown in the graph below for 8 core using 3 threads shows performance improvement of around 1.8x.

For 12 core using 4 pthreads doesn't show much performance improvement except 'bodytrack'. The reason for this behavior is due to ratio of ideal cores to active cores. In 'bodytrack' all the cores are executing benchmark thread where as in other benchmarks not all the cores are executing benchmark all the time, some of them are in 'cpu_ideal' loop.

The graph shows that for 16 cores going up-to 5 threads (3 cores for 4 pthreads and 4 cores for 1 pthread) gives around 2x performance improvement.

Going to 6 pthreads in 20 core configuration reduces the performance compared to 5 threads in 'ferret' and 'raytrace'. We haven't got enough time to find whats causing this behavior.

For 24 cores 'bodytrack' and 'raytrace' shows 2.5x performance improvement for 6 pthreads!! For 'canneal' and 'ferret' 6 and 7 pthread configuration shows performance hit compared to 5 pthread configuration, but it still shows around 2x speedup compared to single thread runs.

Get the source code
Source code is available in 'mt-sim' branch of github repo. Checkout using:

 $ git fetch origin
$ git checkout -t origin/mt-sim'

To enable multi-threaded simulation use following simconfig option:

-threaded-sim N # Where N is the number of pthread

Please remember that this is alpha level code so be ready to hit the bugs and always report them!!

Friday, December 16, 2011

Synchronizing multiple simulation instances

Couple of months ago we introduced a new feature 'syncing multiple simulations instances' (available in 'features' branch right now). This feature allows users to run multiple instances of Marss that synchronize at given number of clock ticks. This blog post talks about what was the reason to implement such a feature.

Super-Duper fast Server Issue

We were running multiple instances of Marss on same machine, one running server and other running clients. Marss simulation speed varies a lot based on type of load so we noticed that when we run server and client together they drift apart in terms of simulated clocks as shown in the graph below.

As shown in the graph after 500 seconds of simulation the 'server' has executed around 80mil cycles where as 'client' has executed only 40mil cycles, a gap of 40+ million of cycles. This gaps keeps increasing as we keep running the simulations. Due to this high difference between two simulations it seems that 'server' is running with more than twice the speed as 'client' and all the request timing measured by 'client' are unrealistic as some of them posted request response within half nano-second. In real life the difference between two machines do exists but not this much - high end servers mostly run around 3 to 3.2 GHz and normal clients (laptops/desktop machines) run around 2 to 2.8 GHz speed. So due to this behavior our study of server and client benchmarks was flawed and we needed to fix the relative speed of each simulation instance.

Synchronizing Simulations

Once we realized the issue with this setup we looked into multiple options to keep the multiple instances in sync so the clock don't drift apart too much as we run simulations for long time. First thing I tried was to limit the maximum number of cycles to execute in given time frame. The issue with this technique was when each instance is capable of running faster even then we were limiting the speed and our total simulation time was increased by more than 2x.

So I decided to give SysV Semaphores a try as described in previous blog post to sync multiple processes using semaphores as barrier. The implementation was very simple, each instance is allowed to execute fix number of cycles between each barrier. So we ran some simulations with different interval size and found out that 200K cycles barrier was good enough to reduce the clock drift between each instance while minimizing the effect on simulation speed.

Here is the resulting graph after running simulations with sync feature. As shown in the image now both the 'server' and 'clients' execute with same frequency so request time measured by 'clients' are now realistic.

To use this feature provide '-sync N' simconfig option to each simulation instance that you want to run in synchronization. Here the N is number of cycles to execute between each sync.

Tuesday, October 4, 2011

Introducing 'features' and 'qemu' branches

After 0.2 release, now we are focusing on next stage of Marss developments. While working on 0.2, we did pushed couple of branches on github (core-models, newstats, qemu-0.14 etc.) and it made things confusing for lot of users. To make things more standardized I have pushed two new branches 'features' and 'qemu' on github.

Features Branch
All new features that are not tested extensively will be first published on 'featured' branch so everyone will have early access to whats coming in next release. At the time of writing this blog, I have pushed new feature 'sync' that will synchronize between multiple instances of Marss to make sure their clocks are not too much out-of-phase. Currently we don't have any To-Do list for next release but we are excited about what users want from Marss. So please leave a comment or send email to let us know which feature you want in next release.

Qemu Branch
One of the good thing about QEMU is the ongoing development. More and more people are contributing to QEMU project. To make things more efficient in integrating work from QEMU to Marss, we are announcing this new 'qemu' branch. This branch will merge changes from latest QEMU version. Currently it has updated QEMU 0.15 release. Changes/updates to this branch will be first merged into 'features' branch and will merge to 'master' when we do a new release.

Do a 'git pull' and checkout these both branches for latest and greatest of Marss. Don't forget to submit bug reports!!

Tuesday, September 27, 2011

Creating a Barrier between Processes with SysV Semaphore

Running deterministic simulations gets too much complicated when you are dealing with more than one processes. In each simulation run you need to make sure that processes are assigned to same CPUs all the time and kernel scheduler does not introduce any randomness. For single process with multiple threads its easy to synchronize them using thread level synchronizations like mutex, barriers etc.. But with multiple processes things get little complicated without proper allocation of resources to each process and its threads. Also the point when we create a checkpoint is important because we need to make sure that all resource allocation by kernel is done and benchmarks are ready to start their Region-Of-Interest (ROI).

In this post I'll explain how to use System V IPC mechanism to create a barrier between multiple processes. The general idea is to wait on a barrier before each thread starts executing ROI and once all threads reach this point we create a checkpoint and run our simulations from this checkpoint only. The trick is to use SysV semaphores as a barrier across processes. It provides a semaphore operation mode where each process/thread wait for semaphore value to become 0. In short we initialize a semaphore to total number of threads and each thread will decrement the semaphore and wait till its value reaches to 0.

Create/Get a Semaphore
First we create a semaphore using semget. As described in man page, if we pass the key parameter to a specific value it will create a process shared semaphore which can be accessed by other processes using semget.

static int SEM_ID = 786;

int get_semaphore ()
{
    int sem_id;

    sem_id = semget(SEM_ID, 1, IPC_CREAT | 0666);

    if (sem_id == -1) {
        perror("get_semaphore: semget");
        exit(1);
    }

    return sem_id;
}

As highlighted in line 7 we use a common SEM_ID to get the semaphore. get_semaphore function returns the semaphore id which will be used in later function to set semaphore value and perform operations on it.

Set Semaphore's Value
Once we get the semaphore id then we call semctl to set the value of the semaphore. Remember that we must call this function only once.

int set_semaphore (int sem_id, int val)
{
    return semctl(sem_id, 0, SETVAL, val);
}

Decrement Semaphore
Now when a thread/process reaches the start of ROI region they should decrement the semaphore value and wait till its value becomes zero. To decrement the semaphore we use semop function. semop uses a pointer to a sembuf structure which specifies the operation to perform on semaphore.

void decrement_semaphore (int sem_id)
{
    struct sembuf sem_op;

    sem_op.sem_num  = 0;
    sem_op.sem_op   = -1; /* <-- Specify decrement operation */
    sem_op.sem_flg = 0;

    semop(sem_id, &sem_op, 1);
}

As highlighted in line 6 we set sem_op variable to -1 which will tell the kernel to perform atomic decrement operation on the semaphore.

Wait for other Threads/Processes
Now we just have to wait till all the processes reach to our barrier. For that we again use semop function. This time we set sem_op variable's value to 0 to tell the kernel that wake-up this thread/process once semaphore value is 0.

void wait_semaphore (int sem_id)
{
    struct sembuf sem_op;

    sem_op.sem_num  = 0;
    sem_op.sem_op   = 0;
    sem_op.sem_flg = 0;

    semop(sem_id, &sem_op, 1);
}

This function call will block until all the threads have decremented the semaphore and its value is 0.

I have put all these function into a single file: sem_helper.h.

Putting it all Together
Now we create a small application (set_semaphore.cpp) that will create and initialize the semaphore to specific value.

#include 
#include 
#include 
#include 

#include "sem_helper.h"
#include "ptlcalls.h"

using namespace std;

int main (int argc, char** argv)
{
    int sem_id;
    int sem_val;
    int rc;

    if (argc < 2) {
        cout << "Please specify the initial semaphore value.\n";
        return 1;
    }

    /* Get semaphore */
    sem_id = get_semaphore();

    /* Retrive semaphore value from command line arg */
    sem_val = atoi(argv[1]);

    /* Set semaphore value */
    rc = set_semaphore(sem_id, sem_val);
    if (rc == -1) {
        perror("set_semaphore: semctl");
        return 1;
    }

    /* Now wait for semaphore to reach to 0 */
    wait_semaphore (sem_id);

    /* All threads will be in ROI so either create a checkpoint or
     * switch to simulation mode. */
    ptlcall_checkpoint_and_shutdown("checkpoint_1");

    return 0;
}

Compile this code into an application and run it as shown below. (We run it in background mode because once all threads reach to barrier we call the 'ptlcall' to create a checkpoint.)

$ ./set_semaphore NUM_THREADS &

In next step we modify our benchmarks to use this semaphore functions to wait at the barrier. Just before ROI begins write the following code.

#include "sem_helper.h"

void sync_all_processes ()
{
    int sem_id;
    int rc;

    /* Get semaphore */
    sem_id = get_semaphore();

    /* Decrement semaphore value */
    decrement_semaphore (sem_id);

    /* Now wait till semaphore value reaches 0 */
    wait_semaphore (sem_id);
}

Tip: For Parsec Benchmarks you can modify the hook to call the sync_all_processes.