Traditionally, in high performance systems, repeatedly allocating and deallocating memory has been found to be costly. (i.e a malloc vs free cycle). Hence, people resorted to building their own memory pool on top of the OS, dealing with fragmentation/free list maintenance etc. One of the popular techniques to doing this being Slab allocators .
This post is about doing a reality check about the cost of explicitly doing an alloc() and free() cycle, given that most popular OS es, specifically Linux gotten better at memory allocation recently. Along the way, I will also compare the JVM memory allocations (which should be faster since we pay a premium for the freeing of memory via garbage collection).
So, all set here we go. The following is a comparison of native c allocations, java jvm based allocation, java direct buffer allocation. For each of them we measure the following.
- Allocation/free rate (rate/s): This gives you an upper bound on the single threaded throughput of your application. You cannot possibly do more operations than this, without building your own memory pools.
- Cost of allocation in mircoseconds (cost-μs): This gives you a lower bound on the latency. You absolutely have to pay atleast this cost per operation without building your own memory pool.
I make a million allocations (except for 100kb case,where I do 100K allocations) varying the object size. The machine is a beefy linux server running Linux 2.6 kernel. For java, I use the following jvm options.
-server -Xms22g -Xmx22g -XX:+AlwaysPreTouch -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2
Swap is turned off and pretouch flag ensures jvm heap is warmed up.
Object size | malloc()-rate/s | malloc()-cost-μs | free()-rate/s | free()-cost-μs | Java Heap alloc-rate/s | Java Heap alloc-cost-μs | Java Direct alloc-rate/s | Java Direct alloc-cost-μs |
---|---|---|---|---|---|---|---|---|
1kb | 2,451,064 | 0.40 | 8,685,833 | 0.11 | 3,992,191 | 0.25 | 268,517 | 3.72 |
10kb | 616,889 | 1.62 | 2,928,069 | 0.34 | 743,676* | 1.34 | 116,713 | 8.56 |
100kb | 566,328 | 1.76 | 1,890,573 | 0.52 | 81,370* | 12.28 | 20,201 | 49.5 |
The 10kb and 100kb runs for Java heap allocation had garbage collections, while direct allocation does not have any collections.
10kb run : 5 ParNews, ~2msHere are some observations.
100kb run : 95 ParNews, 3ms avg, 30ms max
- In general, all allocation costs increase with object size.
- Small object allocations are faster in java as expected than malloc()
- For some reason free() seems to be cheaper/faster than malloc(). Fragmentation tests will tell the real story.
- Java heap allocations considerably slow down with larger objects, when GC kicks in. i.e under memory pressure. native c based allocation don't seem to degrade that drastically from 10kb to 100kb
- Java heap allocations are always faster than direct allocations, which is expected.
- Java direct allocations are supposed to be just a JNI malloc() call. But with 100kb, it seriously slows down and might become a real bottleneck. If you have smal objects mostly, you may be able to get away.
You can get the C program and java program used for this here : malloctest.c AllocTest.java
Next, we will test access speed by basically copying memory (copy rate/s & copy cost-μs) and writing a specific index in the byte array (write rate/s & write cost-μs).
We will be using 100kb byte buffers and average across 1000 operations.
c-write rate/s | 6590 |
---|---|
c-write cost-μs | 151.73 |
c-copy rate/s | 19361 |
c-copy cost-μs | 51.64 |
Java Heap-write rate/s | 5932 |
Java Heap-write cost-μs | 168.56 |
Java Heap-copy rate/s | 30175 |
Java Heap-copy cost-μs | 33.136 |
Java Direct-write rate/s | 4513 |
Java Direct-write cost-μs | 221.5 |
Java Direct-copy rate/s | 30096 |
Java Direct-copy cost-μs | 33.22 |
- index updates are fastest with c and slowest in java direct buffer.
- No huge difference between accessing/copying direct buffers vs jvm heap byte[]
Next, we will test how the memory management schemes hold up against heavily fragmented workload. i.e rapid alloc/frees of randomly sized elements
<Coming Soon>
Comments
Post a Comment