Skip to main content

Improving BDB JE Storage for Voldemort



I am writing this blogpost, mainly to share my experiences with improving the BDB storage engine for use in Voldemort and also throw light on how this relates our VaaS goal. As you drill into the details, I hope you get a clear idea of the efforts that have gone into making BDB JE work. I have tried to include pointers whenever possible.

Note : All of this is written in the context of SSDs. Hence, you will often find me ignoring IOPS completely, focusing on memory.



BDB GC Tuning on SSD 

The post on engineering blog delves to great details about GC issues we faced, when migrating to SSDs. The article mentions the pushing data off the heap, to play nicely with GC.  This is done by setting EVICT_LN cache mode for online traffic and EVICT_BIN cache mode for cursors.  However, to achieve this, we need a higher version of JE > 4.0.117.

We first evaluated BDB 5, which requires a non backwards compatible data conversion to be done on existing data. But we hit issues in the conversion script itself. Once, they were resolved, we hit memory issues and some runaway BDB cleaner issues.

This set us back to JE 4.1.17 since we were not comfortable with the stability of JE5. But we again suffered data bloat issues, due to changes related to how the RMW transaction in Voldemort put() plays with sorted duplicates. Effectively, a new DIN and DBIN (duplicate tree nodes you might recall from JE Architecture whitepaper) for every key, even if the key does not have duplicates. This again set us back.

Better Duplicate Support

Advice from JE folks was to redo the duplicate handling in Voldemort, which is what I am currently testing. After this, we should be able to push data off the JVM heap using the BDB cache for indexes only.

Not caching data might seem counterintuitive at first. But for VaaS to be a reality, we need an efficient way to house multiple stores with different data sizes on the same server without being choked by fragmentation. Applying the two flags greatly reduces promotion into old gen.

This is implemented and being tested.


Optimizing Scan Jobs 

Since we are redoing duplicates, we are also batching several more changes that plays well with Restore and Rebalance jobs. The approach we are taking is to prefix each key with a two byte partition id that could enable efficient range scans to be used for fetching specific partitions from a given server. (Power of the Btree!). This is coded up, perf tested and under review here

Donor based rebalancing in 0.96 already speeds up cluster expansion dramatically. With incorporation of this change into the rebalancing algorithm we expect further gains in this area.


The conversion scripts will be checked into master once we verified everything. The script is blazingly fast and does bulk loading into BDB.

JE Forums thread on large scans here.

 Non Intrusive DatacleanupJob

As noted in the GC blog, the DataCleanupJob perhaps deserves to be treated as a first class citizen. With the partition scan taking care of all other scan jobs, datacleanup remains the only scan where we loop across the entire dataset. We are looking into exploring ways to support the datacleanup using much efficient ways including BDB Secondary indexes. Obviously we need to trade off additional latency to maintain a secondary index vs how much trouble the cleanup actually causes with EVICT_BIN applied to the scan cursor (which seems to solve the problem in most cases)

It is worth mentioning that we did evaluate BDB5 DiskorderedScans as a viable alternative to implement the DataCleanupJob, since it provides a separate buffer for the scan without messing with the BDB cache. However, I hit two fundamental bugs- first was fixed and the second one is shelved. Hence, again BDB5 does not seem to be ready. Its too bad since perf tests show ~2x improvements in throughput and good reduction in 99th latency.

We are hoping the JE team will get around to fixing these issues and we can add DiskOrderedScan support for scan jobs.

Cache Partitioning & Capacity Model

As you might have noted, the points mentioned above tend to improve stability and predictability in the storage layer, which is absolutely necessary to move towards VaaS.


We have an lab verified capacity model that determines the amount of memory that needs to be allocated to a given store, to meet its SLA. Release 0.96 already contains the ability to reserve cache space for the stores you care most using the <memory-footprint> tag in stores.xml.

This model will be tuned as we rollout the different improvements mentioned above. Once we are reasonably comfortable, we will make it available to open source community. We will also be implementing similar resource allocation up the stack , say connections, threads.

Alternate Storage Engines :

High level, I find BDB JE to be a feature rich storage engine. But Voldemort probably has a simple k-v use case.
  • Hash based storage engines might be faster alternatives, while giving up the ability to efficiently scan partitions. We are contemplating writing a specialized storage engine for SSDs
  • My team at Linkedin is also doing a thorough investigation of different storage engines, to see if we can add another off-the-shelf option.
At present, we are working on this as a 20% project. I expect things to pick up steam once some cycles free up from the BDB improvement.

VaaS has a lot of hard problems to be solved and it could take significant amount of time to get them right. We are committed to open source and we will make these available as and when we are ready.

Comments

Popular posts from this blog

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea The 50000 ft blueprint goes like this.

Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque b…

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

Hadoop (Hadoop2/HDFS)YARN (So you can submit MR jobs)Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)Hive (So we can create some tables and play with it) We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 
Download StuffFirst off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper


$ ls $HOME/bin/cdh/5.4.7 hadoop hadoop-2.6.0-cdh5.4.7.…

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.


Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.The receiver based approach is a little complicated IMHO, and …