Open Access Paper
15 January 2025 An efficient compaction scheme for LSM-based storage system via lightweight merges
Shuhuan Fan, Yubo Liu, Rui Xi, Zepei Mu, Mengshu Hou
Author Affiliations +
Proceedings Volume 13513, The International Conference Optoelectronic Information and Optical Engineering (OIOE2024); 135134N (2025) https://doi.org/10.1117/12.3056894
Event: The International Conference Optoelectronic Information and Optical Engineering (OIOE2024), 2024, Wuhan, China
Abstract
With the development of information technology, industry 4.0, mobile applications, etc., the volume of data has shown explosive growth, NoSQL databases are widely used to support data storage and access services. To resolve the write amplification problem for the common storage architecture LSM-Tree, this paper proposes a novel merge scheme, namely LSMA-Tree, by combining the reading technique of leveling scheme and the writing technique of tiering scheme. For each layer of LSM-Tree, this paper maintains the order of the whole key by files like leveling scheme. Furthermore, in the merging stage of the upper and lower files, LSMA-Tree adopts the "append" rather than "reorganization" method to avoid the read and write operations of the lower files, thereby reducing the number of disk IOs. This paper also designs a new sstable file structure to adapt to the proposed LSMA scheme, and implements and verifies it on LevelDB. Experimental results show that the proposed LSMA scheme can effectively reduce write amplification and improve write performance by more than 50%, while the impact on read performance is less than 5%.

1.

INTRODUCTION

With the rapid development of information technology and electronic hardware, modern society processes more and more data, various unstructured and dense data scenarios appear more frequently, and NoSQL databases are used more and more in industrial scenarios. more common. The LSM-Tree (Log-Structured Merge-Tree) structure is widely used in the storage engines of modern NoSQL database systems, such as LevelDB[1], RocksDB[2], Cassandra[3], HBase[4], etc. Among them, LevelDB is an open-source persistent key-value database storage engine released by Google in 2012. It is still being updated continuously. Its design is very delicate. RocksDB is also transformed from LevelDB. The research on storage engine optimization is of great value.

LSM-Tree is hierarchical, ordered, and hard disk-based data structure. As the most common storage architecture for key-value databases, it performs particularly well in frequent write scenarios, so its optimization research has become an important direction. bLSM-Tree[5] introduced the Bloom filter (bloom filter)[6] in LSM-Tree for the first time. Based on the hash structure, it can quickly determine whether a certain record exists in the filter, thus avoiding a lot of unnecessary Necessary queries, thereby improving read performance. [7] proposed the idea of block Bloom filter, which can reduce the read size of Bloom filter. In addition, Bloom filters do not support range queries, and the range filters proposed in [8] can better support range queries. These improvements have greatly improved the read performance of LSM-Tree. One of the most important features of LSM-Tree is that the operations of deleting and updating data records are not real deletion and modification operations, but similar to new data Operations on records, which result in the multi-version nature of data records. Therefore, there will be a lot of redundant and expired data in the LSM-Tree storage system. Cleaning up and merging these data will bring corresponding read and write overhead, which will seriously affect the write performance of the system in write-intensive scenarios.,Works [9,10] reduce the performance cost of data reorganization in the hard disk by using the key-value separation storage method. When the “value” is much larger than the “key”, this method works better, but for scenarios with small values, the effect is not obvious, and it will also bring about a decline in read performance. [11] improves the write performance of the storage engine by reducing the frequency of hard disk data collation, but it adds a lot of additional metadata information compared to the original LSM-Tree, which makes the system much more complicated to implement.

As the main advantage of the LSM-Tree structure, write performance itself is also an eternal topic in the field of database storage engine research, so it is of great significance to study its write performance optimization. Based on this, this paper proposes a new LSMA scheme, which optimizes the writing of LSM-Tree while minimizing the impact on read performance, and implements it based on the classic LSM-Tree storage engine LevelDB to verify the effectiveness of the scheme and improve Its performance was analyzed experimentally. The main contributions of this paper are as follows,

A new merging scheme LSMA combining leveling scheme and the tiering scheme is proposed. This scheme introduces the additional idea of the tiering scheme while ensuring that each layer has only one ordered sequence, in order to further improve the writing performance of LSM-Tree. Reduces write amplification with a relatively little negative impact on read performance;

The LSMA design scheme is implemented based on the storage engine LevelDB. And some evaluations are conducted to verify the effect of the LSMA scheme and analyze the merge performance, read performance and write performance.

The second part of the paper introduces the research progress of related work, the third part introduces the overall scheme design of LSMA, the fourth part introduces the implementation of LSMA scheme based on LevelDB in detail, the fifth part introduces the experimental design and experimental results of this paper, and the sixth part concludes this work.

2.

RELATED WORKS

There are two main optimization ideas for LSM-Tree write performance research. One is to reduce write amplification. Write amplification is the ratio of the actual written data size to the required written data size. A high write amplification rate will limit the write performance of LSM. performance and reduce the lifespan of the hard disk. Many works in this regard are focused on the optimization of the tiering scheme. [12] proposed a group load balancing method - if a group contains too many entries, it will narrow down the key range after the group is merged, and correspondingly Expand the range of group keys at the same level. [13] Separate the hotkeys and coldkeys that are frequently updated in memory, and only flash the coldkeys to disk, and the hotkeys can be modified directly in memory (they will also be periodically copied to the new log). Another important optimization scheme is to optimize the merging operation, which is crucial to the performance of LSM-Tree. The problems caused by merging include buffer cache misses after merging and write stalls during large merges [14]. The work[15] proposed a new merge trigger mechanism to improve all aspects of storage engine performance from a long-term perspective. [16] optimizes the merging performance by appending and merging from the upper layer to the lower layer. It can be regarded as a fusion of the leveling scheme and the tiering scheme, that is, the merging from the upper layer to the lower layer is an append, and it is split or integrated according to the file situation. [17] also believes that the pure tiering scheme and the leveling scheme are both suboptimal. From a heuristic point of view, for leveling, zero-result search, long-range query, and space amplification are all dominated by the largest tier, but the write cost is equal from all layers, therefore, a lazy-leveling strategy is introduced: smaller layers use tiering, and larger layers use leveling. In this way, it has better writing performance than simple leveling and has similar enumeration, long-range search, and space enlargement. The disadvantage is that short-range search is more expensive because of the larger number of components.

3.

METHODOLOGY

In the work of this paper, a new merging scheme—LSMA, which combines the advantages of the leveling scheme and the tiering scheme is proposed, as shown in Figure 1. The main idea is: when the upper-level files are merged into the lower-level, the file selection according to the leveling scheme method, select some files with overlapping key ranges in the upper and lower layers, but when actually performing the merge, instead of reorganizing and sorting the selected files in the upper and lower layers as input files and outputting them to the new file in the lower layer, the files in the upper layer are sorted according to the key range. The range is “appended” to the file corresponding to the key range in the lower layer, and the specific schematic diagram is shown in Figure. 1.

Figure 1.

The principle of compaction for LSMAdesigned.

00167_PSISDG13513_135134N_page_3_1.jpg

The advantages of the scheme in this paper include two aspects: one is that in principle, the leveling scheme still retains the characteristic that each layer has only one ordered sequence, and the other is that this process only involves the IO of the files selected by the upper layer. It is query-friendly and has the write performance and write amplification advantages of the tiering scheme.

3.1

Design of the new sstable

According to the LSMA design scheme, the file structure needs to support the “append” operation, so the original LevelDB sstable architecture can no longer meet the requirements, and a new sstable file format needs to be Figure 1.

There are two issues to consider in this process:

  • 1) The new sstable file may contain multiple old sstable files, how to realize the organization and indexing of these old sstable files.

  • 2) Compared with the original LevelDB, the LSMA scheme does not delete the lower-level files, but appends and updates them. At this time, there may be query operations on the same file at the same time, which is prone to concurrency security issues. However, if the file is directly locked with a mutex, it cannot be written when reading, and cannot be read when writing, which will affect performance. So you need to consider how to design a good structure to support efficient concurrent file reading and writing.

Based on the above considerations, the final design of the new sstable file structure is shown in Figure. 2 a), including a header (header) and multiple old sstable file structures (sst). In the following part, we will explain how to solve the above two problems under this new file structure. For the organization and indexing of the old file sstable, it is realized through the header index sst. The header structure is shown in Figure. 2 b). In order to facilitate reading, the header is set to a fixed length of 100 bytes, of which 4 bytes are used To record the number of ssts, the maximum 12-byte sst tail offset, that is, each new sstable contains at most 12 ssts that can be appended. This avoids including too many ssts, causing huge search time.

Figure 2.

The structure of new sstable.

00167_PSISDG13513_135134N_page_3_2.jpg

For the organization and indexing of the old file sstable, it is realized through the header index sst. The header structure is shown in Figure. 2. b). In order to facilitate reading, the header is set to a fixed length of 100 bytes, of which 4 bytes are used To record the number of ssts, the maximum 12-byte sst tail offset, that is, each new sstable contains at most 12 ssts that can be appended. This avoids including too many ssts, causing huge search time.

For concurrency security issues, because the size of the header design is fixed, and the appended file is performed at the end of the file, it will not affects the header, so it can be directly and safely appended (before the append is completed, the header index is multiple ssts of the old version, and the query operation will not be aware of it), only after the appended file is completed, exclusive access to read and update The header is enough, so it is only a 100-byte read and write operation, which basically does not affect the read performance too much, and can also ensure safe access.

In addition to the above-mentioned new sstable file design, some simplified changes have been made to the sst file structure. The modified structure is shown in Figure. 2 c). After the change, the data block and the filter block are bound together, and a data block corresponds to a filter block, instead of creating a new filter block according to the size of the accumulated data records like the original LevelDB. In addition, after the data block and the filter block It also requires 1-byte combination type and 4-byte CRC check code.

Then there is the index block and tail, the index block is the same as the original, and because the filter block and the data block are bound together, there is no need for a separate metadata index block. The tail structure is shown in Figure. 2 d), with a fixed length of 64 bytes. Because there is no metadata index block, the two parts of metadata offset (meta offset) and metadata size (meta size) are deleted, and the filter name (filter name) part in the metadata index block is added. Also added the number of index entries (index count) and the number of key-value entries (kv num) sections.

3.2

Merge Iterator

The difference with the original LevelDB is that the input file may only contain one level, while the file structure has changed, the iterators also need to be redesigned. Finally, there are the following four iterators according to the type of merger, as shown in Figure. 3.

Figure 3.

The depiction of iterator architecture for different files in LSMA.

00167_PSISDG13513_135134N_page_4_1.jpg

3.3

The writeback after merging

Writeback refers to the processing of the key-value records retained after cleaning. The original LevelDB just appends to the newly created file. If the file size exceeds the set threshold, just create a new file. However, there are additional cases in the proposed new merge solution, which mainly includes two different ways of writing files:

  • 1) For the data that should be added, adopt the division scheme of “exceeding left and not exceeding right”, and directly append it to the corresponding lower-level file, see Figure. 4 for details.

  • 2) Except for case 1, all other records are written to the newly created file. One is that the lower-level file does not have a corresponding key range, and the other is that the data comes from a lower-level file that exceeds the sst number threshold.

Figure 4.

The merge scheme that the left side can exceed and the right side cannot for LSMA.

00167_PSISDG13513_135134N_page_5_1.jpg

4.

IMPLEMENTATION

4.1

The input file selection

Because the header of the new sstable file has a fixed length, the maximum number of ssts it contains is limited, so theoretically there should be a new trigger condition compared to the original trigger condition—the number of ssts contained in new sstable reaches the threshold. In the actual implementation process, there are two situations that affect the merging of the file: one is that the new sstable is used as the upper layer input file for merging, then the file will be read and then deleted without exceeding the threshold; the other is that the new sstable If the new sstable is selected as the lower-level file for merging, it is very likely that the upper-level file is appended to the new sstable and the threshold value exceeds the threshold. In this case, just use the new sstable as an input file and participate in reorganization, sorting and cleaning together with the upper-level files.

Since the process of selecting input files in the original LevelDB was too complicated, a new process was redesigned, as shown in Figure. 5. There are two main improvements. Firstly, the expansion by serial number and the expansion by key range are integrated into one function. Specifically, the left side of the file is expanded by key range, and the right side is expanded by sequence number. Secondly, to improve the redundant operation of the step of expanding the upper-level files without increasing the number of lower-level files and not exceeding the file size threshold, in the new selection scheme, the first expansion will calculate the “sub-boundary” of the lower-level file, and then reversely expand the upper-level file set through this “sub-boundary”. Assuming the expanded lower-level file set is g, then the secondary boundary of the lower file set is the position indicated by two dotted lines, that is, the right boundary of file f and the left boundary of file h. Using this sub-boundary as the boundary to reversely expand the upper-level files will definitely not increase the number of lower-level file sets.

Figure 5.

The process of selecting input files for compaction.

00167_PSISDG13513_135134N_page_5_2.jpg

4.2

Caching and Versioning

In addition to adding and deleting files in traditional storage engines, files will also be modified. Therefore, separate cache processing is required for modified files, otherwise the consistency of queries will be affected. Because new sstable contains multiple ssts that may have overlapping key ranges, both filters and index blocks are set as arrays, and in order to better adapt to the modified sst, rewrite, we added FilterBlockReader and IndexBlock. The specific implementation timing of the cache update of additional files is to immediately force the relevant content to be read from the hard disk to update the cache after each additional file is generated and closed.

After the cache is updated, the merge process is basically completed, and the rest is the version iteration of the file itself. Also because of the particularity of new sstable addition, the version iteration should also be designed separately for it. First of all, all files different from the previous version should be recorded in VersionEdit, and these files should be merged with the files of the corresponding layer of the previous version to form a new version, recorded in the version chain, queued and processed, as shown in the Figure 6.

Figure 6.

The version iteration of files in LSMA.

00167_PSISDG13513_135134N_page_6_1.jpg

In order to permanently save the iterative record of the updated version, the basic attribute of sst count introduced by new sstable is recorded in the manifest file, so as to realize the reconstruction of the LSM-Tree after the database is restarted.

5.

EVALUATION

5.1

Experiment setup

The experimental verification environment is shown in Table 1. We also set the key of the key-value record to 16 bytes and the value to 100 bytes, mainly to test its performance on the basis of the effectiveness of the scheme, including LSMA’s merge performance, write performance and query Performance, the specific experimental method is as follows:

Table 1.

The experimental setups.

TypesParameters
CPU8*Intel (R) Core i5-11357G
Memory16GB
Operation SystemUbuntu 18.04 LTS
Docker VersionV20.10.8
Storage EngineLevelDB v1.23

5.2

Consolidation performance

Continuously write a large amount of data to the storage engine, and then compare the number of merging, IO volume and delay between the LSMA scheme and the original LevelDB scheme under the same conditions.

The comparison of the number of merges is shown in Figure 7. The results show that with the increase of the amount of written data, the number of merges of the LSMA scheme is basically about twice that of LevelDB, and there is a tendency to continue to increase. The comparison of the amount of merged IO is shown in Figure 8. Combining the number of merges, it can be seen that even though the number of merges generated by the LSMA scheme is more, the total IO amount is lower than that of LevelDB. Among them, the amount of read data is maintained at about 55% of LevelDB, and the amount of written data is maintained at about 60%. The difference in read-write ratio is also due to the fact that the amount of data cleaned is less than that of LevelDB.

Figure 7.

The times of compaction.

00167_PSISDG13513_135134N_page_7_1.jpg

Figure 8.

The I/O times of compaction.

00167_PSISDG13513_135134N_page_7_2.jpg

Merge delay comparison is shown in Figure 9. The maximum delay LSMA scheme for a single merge is comparable to LevelDB, but the minimum delay LSMA is an order of magnitude lower than LevelDB as the amount of written data increases. As shown in Figure 10, the total delay is about 70% of that of LevelDB in the LSMA scheme. Combined with the previous test results, it shows that there are a large number of lightweight merges in the LSMA scheme, which is undoubtedly beneficial and can reduce the probability of write stalls.

Figure 9.

Compaction delay.

00167_PSISDG13513_135134N_page_7_3.jpg

Figure 10.

Compaction total delay.

00167_PSISDG13513_135134N_page_8_1.jpg

5.3

Write performance

Use the db_bench of LevelDB itself to make slight changes, test and compare the sequential write performance, random write performance and overwrite performance of the LSMA scheme and the original LevelDB scheme under the same conditions.

The sequential write performance test results are shown in Figure 11 and Figure 12. It can be seen that there is not much difference in performance between the two. This is because the merge triggered by sequential write will never overlap the key range, and a single file is directly moved. To the lower layer, the additional advantages of LSMA cannot be reflected. For data writing, the overall performance changes show a trend of first increasing and then decreasing to a stable level. After repeated testing with a large amount of data, it can be stabilized at more than 50% in the later stage. The overwrite performance test results are shown in Figure 13 and Figure 14. It can be seen from the results that the performance of the LSMA scheme is also better than that of LevelDB. However, as the data is written, the amount of data in the storage engine increases, and the performance improvement decreases. Trend. After repeated testing with a large amount of data, it can be stabilized at more than 20% in the later period. In summary, compared with the original LevelDB, the LSMA scheme can improve random write performance by more than 50%, up to 105%, and overwrite by more than 20%, up to 91%

Figure 11.

Average Write Latency.

00167_PSISDG13513_135134N_page_8_2.jpg

Figure 12.

Write rate.

00167_PSISDG13513_135134N_page_8_3.jpg

Figure 13.

Average write latency.

00167_PSISDG13513_135134N_page_9_1.jpg

Figure 14.

Write rate.

00167_PSISDG13513_135134N_page_9_2.jpg

5.4

Query performance

In addition, the size of the data block cache is set to 16MB, and the average number of bits per key of the Bloom filter is 4. First, we write the data of certain entries in order, and then compare the sequential reads of the LSMA scheme and the original LevelDB scheme performance, random read performance, missing read performance.

The experimental results of sequential read performance, random read performance, and missing read performance are shown in Figures 15 - 18. Due to the existence of cache and Bloom filter, the read performance of the LSMA scheme and LevelDB is almost There is no gap, even when the amount of data is small, for the improved sst, LSMA will have better various read performance. Even if the amount of data increases, the read performance gap between the LSMA scheme and LevelDB is within 5%.

Figure 15.

Sequential check.

00167_PSISDG13513_135134N_page_9_3.jpg

Figure 16.

Random check.

00167_PSISDG13513_135134N_page_10_1.jpg

Figure 17.

Missing rate 25%.

00167_PSISDG13513_135134N_page_10_2.jpg

Figure 18.

Missing rate 50%.

00167_PSISDG13513_135134N_page_10_3.jpg

6.

SUMMARY

The paper takes the storage engine based on the LSM-Tree architecture as the research object, and mainly studies the core of the LSM-Tree—the merge operation, in order to reduce its write amplification, improve write performance, and minimize the impact on read performance. The LSMA scheme was proposed, and finally implemented based on Google’s classic LSM-Tree storage engine LevelDB. The comparative test experiment proved that, compared with LevelDB, the LSMA scheme has a large number of lightweight merges, which can reduce the probability of write pause and the IO of reading and writing. Competition, so it can effectively improve the performance of random writes to more than 50%, and the performance of overwrite writes to more than 20%. Sequential writes are similar to LevelDB because they do not use the characteristics of LSMA. In addition, LSMA is even slightly better than LevelDB when the amount of data is small. Even if the amount of data increases, the performance impact is within 5%. The next research work can expand the experimental test to a larger volume of data sets, and continue to explore the optimization of LSM-Tree in the direction of new hardware and distributed environments.

7.

7.

REFERENCES

[1] 

. [LevelDB, EB/OL, (2022) http://LevelDB.org/ Google Scholar

[2] 

. [Rocks DB, EB/OL, (2022) http://rocksdb.org/ Google Scholar

[3] 

Becker M Y, Sewell P, “Cassandra: Flexible trust management, applied to electronic health records[C],” in Proceedings of the 17th IEEE Computer Security Foundations Workshop, 139 –154 (2004). Google Scholar

[4] 

[5] 

Sears R, Ramakrishnan R, “bLSM: a general purpose log structured merge tree[C],” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 217 –228 (2012). Google Scholar

[6] 

Bloom B H, “Space/time trade-offs in hash coding with allowable errors[J],” Communications of the ACM, 13 (7), 422 –426 (1970). https://doi.org/10.1145/362686.362692 Google Scholar

[7] 

Luo C, Carey M J, “Efficient data ingestion and query processing for LSM-based storage systems[J],” PVLDB, 531 –543 (2019). Google Scholar

[8] 

Zhang H, Lim H, Leis V, et al, “Surf: Practical range query filtering with fast succinct tries[C],” in Proceedings of the 2018 International Conference on Management of Data, 323 –336 (2018). Google Scholar

[9] 

Lu L, Pillai T S, Gopalakrishnan H, et al, “Wisckey: Separating keys from values in ssd-conscious storage[J],” ACM Transactions on Storage (TOS), 13 (1), 1 –28 (2017). https://doi.org/10.1145/3033273 Google Scholar

[10] 

Lai C, Jiang S, Yang L, et al, “Atlas: Baidu’s key-value storage system for cloud data[C],” in 2015 31st Symposium on Mass Storage Systems and Technologies (MSST), 1 –14 (2015). Google Scholar

[11] 

Pan F F, Yue Y L, Xiong J, “dcompaction: Speeding up compaction of the lsm-tree via delayed compaction[J],” Journal of Computer Science and Technology, 32 (1), 41 –54 (2017). https://doi.org/10.1007/s11390-017-1704-4 Google Scholar

[12] 

Yao T, Wan J, Huang P, et al, “Building efficient key-value stores via a lightweight compaction tree[J],” ACM Transactions on Storage (TOS), 13 (4), 1 –28 (2017). https://doi.org/10.1145/3139922 Google Scholar

[13] 

Balmau O, Didona D, Guerraoui R, et al, “{TRIAD}: Creating Synergies Between Memory, Disk and Log in Log Structured {Key-Value} Stores[C],” in 2017 USENIX Annual Technical Conference, 363 –375 (2017). Google Scholar

[14] 

Balmau O, Dinu F, Zwaenepoel W, et al, “SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores[C],” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 753 –766 (2019). Google Scholar

[15] 

Sarkar S, Papon T I, Staratzis D, et al, “Lethe: A tunable delete-aware LSM engine[C],” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 893 –908 (2020). Google Scholar

[16] 

Gong C, He S, Gong Y, et al, “On integration of appends and merges in log-structured merge trees[C],” in Proceedings of the 48th International Conference on Parallel Processing, 1 –10 (2019). Google Scholar

[17] 

Dayan N, Idreos S, “Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging[C],” in Proceedings of the 2018 International Conference on Management of Data, 505 –520 (2018). Google Scholar

[18] 

(2025) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Shuhuan Fan, Yubo Liu, Rui Xi, Zepei Mu, and Mengshu Hou "An efficient compaction scheme for LSM-based storage system via lightweight merges", Proc. SPIE 13513, The International Conference Optoelectronic Information and Optical Engineering (OIOE2024), 135134N (15 January 2025); https://doi.org/10.1117/12.3056894
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Tunable filters

Data storage

Databases

Information operations

Computer science

Electronics engineering

Engineering

Back to Top