Skip to main content

Page cache perils

I am sharing my experience with benchmarking the Voldemort Server- effectively a Java key-value storage server application on Linux (referred as 'server' from here on). The application handles put(k,v) and get(k), like a standard HashMap, only difference being these are calls over the network, and the entries need to be persistent.What the post talks about is generic and could apply to most java server applications.

The goal here was to run a workload against the server, in a manner that it comes off disk, so we exercise the worst case path. As easy as it may sound, certain things come in our way

  • OS Page Cache - Linux caches files the server writes/reads (unless you do direct I/O and build your own cache on top of it)
  • JVM memory management - the JVM heap shares the same physical memory as the page cache and you don't have very fine control over how much memory it will take. 
  • This interplay is tricky as we will see below, in terms of actually controlling the page cache size (I know you are not supposed to), so our test parameters be met.
  • In my case, we have a BDB JE cache on top of the JVM, which makes the page cache pointless, since we are double caching, the same data. But, its another separate issue. Lets leave that out now.


Lets us some notations, for different knobs/parameters that play together here.

RAMSIZE - actual physical memory on the machine
JVMMAX   - Maximum size of jvm you configure using -Xmx
JVMSTART - Start size of the jvm, -Xms
JVMUSED - actual  memory that the JVM uses.

Lets explore our options.



Generate lots of data



This is a practical option since if you generate data several times higher than the RAMSIZE, a large portion of your workload will come off disk. Problem is that with modern servers (+ slow SAS disks) with close to 100GB of ram, this process takes a long time. i.e You need to generate 1TB of data to make sure only 10% of the workload will come off cache on an average.


Use a memory hogger (see mlock()) to clamp down RAMSIZE - JVMMAX
The idea is to clamp down a portion of the ram, virtually shrinking it, so that Linux feels the shortage of memory and does not populate the page cache as much.
But, the jvm does not consume JVMMAX bytes right away, hence JVMMAX - JVMUSED is available for Linux, to use as it pleases.

Set JVMSTART=RAMSIZE 
The idea here is to force the jvm heap to be as large as the RAM, leaving little space for the page cache.
But, as we saw before, jvm does not allocate this many, even if you explicitly set the start size.

Hog up RAMSIZE - JVMUSED
The idea is to examine the actual amount of memory the server uses for the workload, with a dry run, and then clamp down the rest using a hogger.
But, the jvm thinks that its crunched for memory (lets not go deep into GC tuning here), and starts GCing a lot, potentially affecting the experiment.

Tune GC so you won't GC
The idea (a bit far fetched) is to tune the server gc settings so it won't gc in the case above.
But, its quite hard to do in practice and what if you want to run different workloads and your gc settings are workload specific.

Disable page cache altogether
The idea is to periodically do a  sync; echo 0 > /proc/sys/vm/drop_caches which should flush all of the page cache.
But, this takes up some CPU and if the server is cpu bound, you might be potentially altering the experiment.



So, whats the verdict you ask ? :)  Attempt one of the above if in your particular scenario, the "buts" are non existent somehow or simply generate lots of data.










Comments

Popular posts from this blog

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea The 50000 ft blueprint goes like this.

Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque b…

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

Hadoop (Hadoop2/HDFS)YARN (So you can submit MR jobs)Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)Hive (So we can create some tables and play with it) We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 
Download StuffFirst off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper


$ ls $HOME/bin/cdh/5.4.7 hadoop hadoop-2.6.0-cdh5.4.7.…

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.


Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.The receiver based approach is a little complicated IMHO, and …