Skip to main content

Page cache perils

I am sharing my experience with benchmarking the Voldemort Server- effectively a Java key-value storage server application on Linux (referred as 'server' from here on). The application handles put(k,v) and get(k), like a standard HashMap, only difference being these are calls over the network, and the entries need to be persistent.What the post talks about is generic and could apply to most java server applications.

The goal here was to run a workload against the server, in a manner that it comes off disk, so we exercise the worst case path. As easy as it may sound, certain things come in our way

  • OS Page Cache - Linux caches files the server writes/reads (unless you do direct I/O and build your own cache on top of it)
  • JVM memory management - the JVM heap shares the same physical memory as the page cache and you don't have very fine control over how much memory it will take. 
  • This interplay is tricky as we will see below, in terms of actually controlling the page cache size (I know you are not supposed to), so our test parameters be met.
  • In my case, we have a BDB JE cache on top of the JVM, which makes the page cache pointless, since we are double caching, the same data. But, its another separate issue. Lets leave that out now.

Lets us some notations, for different knobs/parameters that play together here.

RAMSIZE - actual physical memory on the machine
JVMMAX   - Maximum size of jvm you configure using -Xmx
JVMSTART - Start size of the jvm, -Xms
JVMUSED - actual  memory that the JVM uses.

Lets explore our options.

Generate lots of data

This is a practical option since if you generate data several times higher than the RAMSIZE, a large portion of your workload will come off disk. Problem is that with modern servers (+ slow SAS disks) with close to 100GB of ram, this process takes a long time. i.e You need to generate 1TB of data to make sure only 10% of the workload will come off cache on an average.

Use a memory hogger (see mlock()) to clamp down RAMSIZE - JVMMAX
The idea is to clamp down a portion of the ram, virtually shrinking it, so that Linux feels the shortage of memory and does not populate the page cache as much.
But, the jvm does not consume JVMMAX bytes right away, hence JVMMAX - JVMUSED is available for Linux, to use as it pleases.

The idea here is to force the jvm heap to be as large as the RAM, leaving little space for the page cache.
But, as we saw before, jvm does not allocate this many, even if you explicitly set the start size.

The idea is to examine the actual amount of memory the server uses for the workload, with a dry run, and then clamp down the rest using a hogger.
But, the jvm thinks that its crunched for memory (lets not go deep into GC tuning here), and starts GCing a lot, potentially affecting the experiment.

Tune GC so you won't GC
The idea (a bit far fetched) is to tune the server gc settings so it won't gc in the case above.
But, its quite hard to do in practice and what if you want to run different workloads and your gc settings are workload specific.

Disable page cache altogether
The idea is to periodically do a  sync; echo 0 > /proc/sys/vm/drop_caches which should flush all of the page cache.
But, this takes up some CPU and if the server is cpu bound, you might be potentially altering the experiment.

So, whats the verdict you ask ? :)  Attempt one of the above if in your particular scenario, the "buts" are non existent somehow or simply generate lots of data.


Popular posts from this blog

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming . In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed. Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide .  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second. As with any stream processing system, there are three big aspects to the framework itself. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input. The receiver based approach is a little compl

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely Hadoop (Hadoop2/HDFS) YARN (So you can submit MR jobs) Spark (We will illustrate with Spark Shell, but should work on YARN mode as well) Hive (So we can create some tables and play with it)  We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here , we just link the four pieces together nicely in this blog.  Download Stuff First off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page ) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper $ ls $HOME /bin/cdh/5.4.7 hadoop

HDFS Client Configs for talking to HA Hadoop NameNodes

One more simple thing, that had relatively scarce documentation out on the Internet. As you might know, Hadoop NameNodes finally became HA in 2.0 . The HDFS client configuration, which is already a little bit tedious, became more complicated. Traditionally, there were two ways to configure a HDFS client (lets stick to Java) Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files. Simply provide the HDFS NameNode URI and let the client do the rest.          Configuration conf = new Configuration(false);         conf.set("", "hdfs://localhost:8020"); // this is deprecated now         conf.set("fs.defaultFS", "hdfs://localhost:8020");         FileSystem fs = FileSystem.get(conf); Most people prefer 2, unless you need way more configs from the actual xml config files, at which po