I am writing this blogpost, mainly to share my experiences with improving the BDB storage engine for use in Voldemort and also throw light on how this relates our VaaS goal. As you drill into the details, I hope you get a clear idea of the efforts that have gone into making BDB JE work. I have tried to include pointers whenever possible.
Note : All of this is written in the context of SSDs. Hence, you will often find me ignoring IOPS completely, focusing on memory.
BDB GC Tuning on SSD
The post on engineering blog delves to great details about GC issues we faced, when migrating to SSDs. The article mentions the pushing data off the heap, to play nicely with GC. This is done by setting EVICT_LN cache mode for online traffic and EVICT_BIN cache mode for cursors. However, to achieve this, we need a higher version of JE > 4.0.117.
We first evaluated BDB 5, which requires a non backwards compatible data conversion to be done on existing data. But we hit issues in the conversion script itself. Once, they were resolved, we hit memory issues and some runaway BDB cleaner issues.
This set us back to JE 4.1.17 since we were not comfortable with the stability of JE5. But we again suffered data bloat issues, due to changes related to how the RMW transaction in Voldemort put() plays with sorted duplicates. Effectively, a new DIN and DBIN (duplicate tree nodes you might recall from JE Architecture whitepaper) for every key, even if the key does not have duplicates. This again set us back.
Better Duplicate Support
Advice from JE folks was to redo the duplicate handling in Voldemort, which is what I am currently testing. After this, we should be able to push data off the JVM heap using the BDB cache for indexes only.
Not caching data might seem counterintuitive at first. But for VaaS to be a reality, we need an efficient way to house multiple stores with different data sizes on the same server without being choked by fragmentation. Applying the two flags greatly reduces promotion into old gen.
This is implemented and being tested.
Optimizing Scan Jobs
Since we are redoing duplicates, we are also batching several more changes that plays well with Restore and Rebalance jobs. The approach we are taking is to prefix each key with a two byte partition id that could enable efficient range scans to be used for fetching specific partitions from a given server. (Power of the Btree!). This is coded up, perf tested and under review here
Donor based rebalancing in 0.96 already speeds up cluster expansion dramatically. With incorporation of this change into the rebalancing algorithm we expect further gains in this area.
The conversion scripts will be checked into master once we verified everything. The script is blazingly fast and does bulk loading into BDB.
JE Forums thread on large scans here.
Non Intrusive DatacleanupJob
As noted in the GC blog, the DataCleanupJob perhaps deserves to be treated as a first class citizen. With the partition scan taking care of all other scan jobs, datacleanup remains the only scan where we loop across the entire dataset. We are looking into exploring ways to support the datacleanup using much efficient ways including BDB Secondary indexes. Obviously we need to trade off additional latency to maintain a secondary index vs how much trouble the cleanup actually causes with EVICT_BIN applied to the scan cursor (which seems to solve the problem in most cases)
It is worth mentioning that we did evaluate BDB5 DiskorderedScans as a viable alternative to implement the DataCleanupJob, since it provides a separate buffer for the scan without messing with the BDB cache. However, I hit two fundamental bugs- first was fixed and the second one is shelved. Hence, again BDB5 does not seem to be ready. Its too bad since perf tests show ~2x improvements in throughput and good reduction in 99th latency.
We are hoping the JE team will get around to fixing these issues and we can add DiskOrderedScan support for scan jobs.
Cache Partitioning & Capacity Model
As you might have noted, the points mentioned above tend to improve stability and predictability in the storage layer, which is absolutely necessary to move towards VaaS.
We have an lab verified capacity model that determines the amount of memory that needs to be allocated to a given store, to meet its SLA. Release 0.96 already contains the ability to reserve cache space for the stores you care most using the <memory-footprint> tag in stores.xml.
This model will be tuned as we rollout the different improvements mentioned above. Once we are reasonably comfortable, we will make it available to open source community. We will also be implementing similar resource allocation up the stack , say connections, threads.
Alternate Storage Engines :
High level, I find BDB JE to be a feature rich storage engine. But Voldemort probably has a simple k-v use case.
- Hash based storage engines might be faster alternatives, while giving up the ability to efficiently scan partitions. We are contemplating writing a specialized storage engine for SSDs
- My team at Linkedin is also doing a thorough investigation of different storage engines, to see if we can add another off-the-shelf option.
VaaS has a lot of hard problems to be solved and it could take significant amount of time to get them right. We are committed to open source and we will make these available as and when we are ready.