Skip to main content

Memory allocation speed check


Traditionally, in high performance systems, repeatedly allocating and deallocating memory has been found to be costly. (i.e a malloc vs free cycle). Hence, people resorted to building their own memory pool on top of the OS, dealing with fragmentation/free list maintenance etc. One of the popular techniques to doing this being Slab allocators .

This post is about doing a reality check about the cost of explicitly doing an alloc() and free() cycle, given that most popular OS es, specifically Linux gotten better at memory allocation recently. Along the way, I will also compare the JVM memory allocations (which should be faster since we pay a premium for the freeing of memory via garbage collection).

So, all set here we go.  The following is a comparison of native c allocations, java jvm based allocation, java direct buffer allocation. For each of them we measure the following.

  • Allocation/free rate (rate/s): This gives you an upper bound on the single threaded throughput of your application. You cannot possibly do more operations than this, without building your own memory pools.
  • Cost of allocation in mircoseconds (cost-μs): This gives you a lower bound on the latency. You absolutely have to pay atleast this cost per operation without building your own memory pool.

I make a million allocations (except for 100kb case,where I do 100K allocations) varying the object size. The machine is a beefy linux server running Linux 2.6 kernel. For java, I use the following jvm options.

-server -Xms22g -Xmx22g -XX:+AlwaysPreTouch -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2

Swap is turned off and pretouch flag ensures jvm heap is warmed up.

Object size malloc()-rate/s malloc()-cost-μs free()-rate/s free()-cost-μs Java Heap alloc-rate/s Java Heap alloc-cost-μs Java Direct alloc-rate/s Java Direct alloc-cost-μs
1kb 2,451,064 0.40 8,685,833 0.11 3,992,191 0.25 268,517 3.72
10kb 616,889 1.62 2,928,069 0.34 743,676* 1.34 116,713 8.56
100kb 566,328 1.76 1,890,573 0.52 81,370* 12.28 20,201 49.5

The 10kb and 100kb runs for Java heap allocation had garbage collections, while direct allocation does not have any collections.
10kb run   : 5 ParNews, ~2ms
100kb run :  95 ParNews, 3ms avg, 30ms max
 Here are some observations.
  • In general, all allocation costs increase with object size. 
  • Small object allocations are faster in java as expected than malloc()
  • For some reason free() seems to be cheaper/faster than malloc(). Fragmentation tests will tell the real story.
  • Java heap allocations considerably slow down with larger objects, when GC kicks in. i.e under memory pressure. native c based allocation don't seem to degrade that drastically from 10kb to 100kb
  • Java heap allocations are always faster than direct allocations, which is expected.
  • Java direct allocations are supposed to be just a JNI malloc() call. But with 100kb, it seriously slows down and might become a real bottleneck. If you have smal objects mostly, you may be able to get away.

You can get the C program and java program used for this here : malloctest.c  AllocTest.java


Next, we will test access speed by basically copying memory (copy rate/s & copy cost-μs) and writing a specific index in the byte array (write rate/s & write cost-μs).
We will be using 100kb byte buffers and average across 1000 operations.

c-write rate/s 6590
c-write cost-μs 151.73
c-copy rate/s 19361
c-copy cost-μs 51.64
Java Heap-write rate/s 5932
Java Heap-write cost-μs 168.56
Java Heap-copy rate/s 30175
Java Heap-copy cost-μs 33.136
Java Direct-write rate/s 4513
Java Direct-write cost-μs 221.5
Java Direct-copy rate/s 30096
Java Direct-copy cost-μs 33.22

  • index updates are fastest with c and slowest in java direct buffer. 
  • No huge difference between accessing/copying direct buffers vs jvm heap byte[]
 Code used for these. accesstest.c AccessHeapTest.java AccessDirectTest.java

Next, we will test how the memory management schemes hold up against heavily fragmented workload. i.e rapid alloc/frees of randomly sized elements
<Coming Soon>




Comments

Popular posts from this blog

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea The 50000 ft blueprint goes like this.

Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque b…

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.


Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.The receiver based approach is a little complicated IMHO, and …

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

Hadoop (Hadoop2/HDFS)YARN (So you can submit MR jobs)Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)Hive (So we can create some tables and play with it) We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 
Download StuffFirst off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper


$ ls $HOME/bin/cdh/5.4.7 hadoop hadoop-2.6.0-cdh5.4.7.…