|
1.INTRODUCTIONWith the rapid development of information technology and electronic hardware, modern society processes more and more data, various unstructured and dense data scenarios appear more frequently, and NoSQL databases are used more and more in industrial scenarios. more common. The LSM-Tree (Log-Structured Merge-Tree) structure is widely used in the storage engines of modern NoSQL database systems, such as LevelDB[1], RocksDB[2], Cassandra[3], HBase[4], etc. Among them, LevelDB is an open-source persistent key-value database storage engine released by Google in 2012. It is still being updated continuously. Its design is very delicate. RocksDB is also transformed from LevelDB. The research on storage engine optimization is of great value. LSM-Tree is hierarchical, ordered, and hard disk-based data structure. As the most common storage architecture for key-value databases, it performs particularly well in frequent write scenarios, so its optimization research has become an important direction. bLSM-Tree[5] introduced the Bloom filter (bloom filter)[6] in LSM-Tree for the first time. Based on the hash structure, it can quickly determine whether a certain record exists in the filter, thus avoiding a lot of unnecessary Necessary queries, thereby improving read performance. [7] proposed the idea of block Bloom filter, which can reduce the read size of Bloom filter. In addition, Bloom filters do not support range queries, and the range filters proposed in [8] can better support range queries. These improvements have greatly improved the read performance of LSM-Tree. One of the most important features of LSM-Tree is that the operations of deleting and updating data records are not real deletion and modification operations, but similar to new data Operations on records, which result in the multi-version nature of data records. Therefore, there will be a lot of redundant and expired data in the LSM-Tree storage system. Cleaning up and merging these data will bring corresponding read and write overhead, which will seriously affect the write performance of the system in write-intensive scenarios.,Works [9,10] reduce the performance cost of data reorganization in the hard disk by using the key-value separation storage method. When the “value” is much larger than the “key”, this method works better, but for scenarios with small values, the effect is not obvious, and it will also bring about a decline in read performance. [11] improves the write performance of the storage engine by reducing the frequency of hard disk data collation, but it adds a lot of additional metadata information compared to the original LSM-Tree, which makes the system much more complicated to implement. As the main advantage of the LSM-Tree structure, write performance itself is also an eternal topic in the field of database storage engine research, so it is of great significance to study its write performance optimization. Based on this, this paper proposes a new LSMA scheme, which optimizes the writing of LSM-Tree while minimizing the impact on read performance, and implements it based on the classic LSM-Tree storage engine LevelDB to verify the effectiveness of the scheme and improve Its performance was analyzed experimentally. The main contributions of this paper are as follows, A new merging scheme LSMA combining leveling scheme and the tiering scheme is proposed. This scheme introduces the additional idea of the tiering scheme while ensuring that each layer has only one ordered sequence, in order to further improve the writing performance of LSM-Tree. Reduces write amplification with a relatively little negative impact on read performance; The LSMA design scheme is implemented based on the storage engine LevelDB. And some evaluations are conducted to verify the effect of the LSMA scheme and analyze the merge performance, read performance and write performance. The second part of the paper introduces the research progress of related work, the third part introduces the overall scheme design of LSMA, the fourth part introduces the implementation of LSMA scheme based on LevelDB in detail, the fifth part introduces the experimental design and experimental results of this paper, and the sixth part concludes this work. 2.RELATED WORKSThere are two main optimization ideas for LSM-Tree write performance research. One is to reduce write amplification. Write amplification is the ratio of the actual written data size to the required written data size. A high write amplification rate will limit the write performance of LSM. performance and reduce the lifespan of the hard disk. Many works in this regard are focused on the optimization of the tiering scheme. [12] proposed a group load balancing method - if a group contains too many entries, it will narrow down the key range after the group is merged, and correspondingly Expand the range of group keys at the same level. [13] Separate the hotkeys and coldkeys that are frequently updated in memory, and only flash the coldkeys to disk, and the hotkeys can be modified directly in memory (they will also be periodically copied to the new log). Another important optimization scheme is to optimize the merging operation, which is crucial to the performance of LSM-Tree. The problems caused by merging include buffer cache misses after merging and write stalls during large merges [14]. The work[15] proposed a new merge trigger mechanism to improve all aspects of storage engine performance from a long-term perspective. [16] optimizes the merging performance by appending and merging from the upper layer to the lower layer. It can be regarded as a fusion of the leveling scheme and the tiering scheme, that is, the merging from the upper layer to the lower layer is an append, and it is split or integrated according to the file situation. [17] also believes that the pure tiering scheme and the leveling scheme are both suboptimal. From a heuristic point of view, for leveling, zero-result search, long-range query, and space amplification are all dominated by the largest tier, but the write cost is equal from all layers, therefore, a lazy-leveling strategy is introduced: smaller layers use tiering, and larger layers use leveling. In this way, it has better writing performance than simple leveling and has similar enumeration, long-range search, and space enlargement. The disadvantage is that short-range search is more expensive because of the larger number of components. 3.METHODOLOGYIn the work of this paper, a new merging scheme—LSMA, which combines the advantages of the leveling scheme and the tiering scheme is proposed, as shown in Figure 1. The main idea is: when the upper-level files are merged into the lower-level, the file selection according to the leveling scheme method, select some files with overlapping key ranges in the upper and lower layers, but when actually performing the merge, instead of reorganizing and sorting the selected files in the upper and lower layers as input files and outputting them to the new file in the lower layer, the files in the upper layer are sorted according to the key range. The range is “appended” to the file corresponding to the key range in the lower layer, and the specific schematic diagram is shown in Figure. 1. The advantages of the scheme in this paper include two aspects: one is that in principle, the leveling scheme still retains the characteristic that each layer has only one ordered sequence, and the other is that this process only involves the IO of the files selected by the upper layer. It is query-friendly and has the write performance and write amplification advantages of the tiering scheme. 3.1Design of the new sstableAccording to the LSMA design scheme, the file structure needs to support the “append” operation, so the original LevelDB sstable architecture can no longer meet the requirements, and a new sstable file format needs to be Figure 1. There are two issues to consider in this process:
Based on the above considerations, the final design of the new sstable file structure is shown in Figure. 2 a), including a header (header) and multiple old sstable file structures (sst). In the following part, we will explain how to solve the above two problems under this new file structure. For the organization and indexing of the old file sstable, it is realized through the header index sst. The header structure is shown in Figure. 2 b). In order to facilitate reading, the header is set to a fixed length of 100 bytes, of which 4 bytes are used To record the number of ssts, the maximum 12-byte sst tail offset, that is, each new sstable contains at most 12 ssts that can be appended. This avoids including too many ssts, causing huge search time. For the organization and indexing of the old file sstable, it is realized through the header index sst. The header structure is shown in Figure. 2. b). In order to facilitate reading, the header is set to a fixed length of 100 bytes, of which 4 bytes are used To record the number of ssts, the maximum 12-byte sst tail offset, that is, each new sstable contains at most 12 ssts that can be appended. This avoids including too many ssts, causing huge search time. For concurrency security issues, because the size of the header design is fixed, and the appended file is performed at the end of the file, it will not affects the header, so it can be directly and safely appended (before the append is completed, the header index is multiple ssts of the old version, and the query operation will not be aware of it), only after the appended file is completed, exclusive access to read and update The header is enough, so it is only a 100-byte read and write operation, which basically does not affect the read performance too much, and can also ensure safe access. In addition to the above-mentioned new sstable file design, some simplified changes have been made to the sst file structure. The modified structure is shown in Figure. 2 c). After the change, the data block and the filter block are bound together, and a data block corresponds to a filter block, instead of creating a new filter block according to the size of the accumulated data records like the original LevelDB. In addition, after the data block and the filter block It also requires 1-byte combination type and 4-byte CRC check code. Then there is the index block and tail, the index block is the same as the original, and because the filter block and the data block are bound together, there is no need for a separate metadata index block. The tail structure is shown in Figure. 2 d), with a fixed length of 64 bytes. Because there is no metadata index block, the two parts of metadata offset (meta offset) and metadata size (meta size) are deleted, and the filter name (filter name) part in the metadata index block is added. Also added the number of index entries (index count) and the number of key-value entries (kv num) sections. 3.2Merge IteratorThe difference with the original LevelDB is that the input file may only contain one level, while the file structure has changed, the iterators also need to be redesigned. Finally, there are the following four iterators according to the type of merger, as shown in Figure. 3. 3.3The writeback after mergingWriteback refers to the processing of the key-value records retained after cleaning. The original LevelDB just appends to the newly created file. If the file size exceeds the set threshold, just create a new file. However, there are additional cases in the proposed new merge solution, which mainly includes two different ways of writing files:
4.IMPLEMENTATION4.1The input file selectionBecause the header of the new sstable file has a fixed length, the maximum number of ssts it contains is limited, so theoretically there should be a new trigger condition compared to the original trigger condition—the number of ssts contained in new sstable reaches the threshold. In the actual implementation process, there are two situations that affect the merging of the file: one is that the new sstable is used as the upper layer input file for merging, then the file will be read and then deleted without exceeding the threshold; the other is that the new sstable If the new sstable is selected as the lower-level file for merging, it is very likely that the upper-level file is appended to the new sstable and the threshold value exceeds the threshold. In this case, just use the new sstable as an input file and participate in reorganization, sorting and cleaning together with the upper-level files. Since the process of selecting input files in the original LevelDB was too complicated, a new process was redesigned, as shown in Figure. 5. There are two main improvements. Firstly, the expansion by serial number and the expansion by key range are integrated into one function. Specifically, the left side of the file is expanded by key range, and the right side is expanded by sequence number. Secondly, to improve the redundant operation of the step of expanding the upper-level files without increasing the number of lower-level files and not exceeding the file size threshold, in the new selection scheme, the first expansion will calculate the “sub-boundary” of the lower-level file, and then reversely expand the upper-level file set through this “sub-boundary”. Assuming the expanded lower-level file set is g, then the secondary boundary of the lower file set is the position indicated by two dotted lines, that is, the right boundary of file f and the left boundary of file h. Using this sub-boundary as the boundary to reversely expand the upper-level files will definitely not increase the number of lower-level file sets. 4.2Caching and VersioningIn addition to adding and deleting files in traditional storage engines, files will also be modified. Therefore, separate cache processing is required for modified files, otherwise the consistency of queries will be affected. Because new sstable contains multiple ssts that may have overlapping key ranges, both filters and index blocks are set as arrays, and in order to better adapt to the modified sst, rewrite, we added FilterBlockReader and IndexBlock. The specific implementation timing of the cache update of additional files is to immediately force the relevant content to be read from the hard disk to update the cache after each additional file is generated and closed. After the cache is updated, the merge process is basically completed, and the rest is the version iteration of the file itself. Also because of the particularity of new sstable addition, the version iteration should also be designed separately for it. First of all, all files different from the previous version should be recorded in VersionEdit, and these files should be merged with the files of the corresponding layer of the previous version to form a new version, recorded in the version chain, queued and processed, as shown in the Figure 6. In order to permanently save the iterative record of the updated version, the basic attribute of sst count introduced by new sstable is recorded in the manifest file, so as to realize the reconstruction of the LSM-Tree after the database is restarted. 5.EVALUATION5.1Experiment setupThe experimental verification environment is shown in Table 1. We also set the key of the key-value record to 16 bytes and the value to 100 bytes, mainly to test its performance on the basis of the effectiveness of the scheme, including LSMA’s merge performance, write performance and query Performance, the specific experimental method is as follows: Table 1.The experimental setups.
5.2Consolidation performanceContinuously write a large amount of data to the storage engine, and then compare the number of merging, IO volume and delay between the LSMA scheme and the original LevelDB scheme under the same conditions. The comparison of the number of merges is shown in Figure 7. The results show that with the increase of the amount of written data, the number of merges of the LSMA scheme is basically about twice that of LevelDB, and there is a tendency to continue to increase. The comparison of the amount of merged IO is shown in Figure 8. Combining the number of merges, it can be seen that even though the number of merges generated by the LSMA scheme is more, the total IO amount is lower than that of LevelDB. Among them, the amount of read data is maintained at about 55% of LevelDB, and the amount of written data is maintained at about 60%. The difference in read-write ratio is also due to the fact that the amount of data cleaned is less than that of LevelDB. Merge delay comparison is shown in Figure 9. The maximum delay LSMA scheme for a single merge is comparable to LevelDB, but the minimum delay LSMA is an order of magnitude lower than LevelDB as the amount of written data increases. As shown in Figure 10, the total delay is about 70% of that of LevelDB in the LSMA scheme. Combined with the previous test results, it shows that there are a large number of lightweight merges in the LSMA scheme, which is undoubtedly beneficial and can reduce the probability of write stalls. 5.3Write performanceUse the db_bench of LevelDB itself to make slight changes, test and compare the sequential write performance, random write performance and overwrite performance of the LSMA scheme and the original LevelDB scheme under the same conditions. The sequential write performance test results are shown in Figure 11 and Figure 12. It can be seen that there is not much difference in performance between the two. This is because the merge triggered by sequential write will never overlap the key range, and a single file is directly moved. To the lower layer, the additional advantages of LSMA cannot be reflected. For data writing, the overall performance changes show a trend of first increasing and then decreasing to a stable level. After repeated testing with a large amount of data, it can be stabilized at more than 50% in the later stage. The overwrite performance test results are shown in Figure 13 and Figure 14. It can be seen from the results that the performance of the LSMA scheme is also better than that of LevelDB. However, as the data is written, the amount of data in the storage engine increases, and the performance improvement decreases. Trend. After repeated testing with a large amount of data, it can be stabilized at more than 20% in the later period. In summary, compared with the original LevelDB, the LSMA scheme can improve random write performance by more than 50%, up to 105%, and overwrite by more than 20%, up to 91% 5.4Query performanceIn addition, the size of the data block cache is set to 16MB, and the average number of bits per key of the Bloom filter is 4. First, we write the data of certain entries in order, and then compare the sequential reads of the LSMA scheme and the original LevelDB scheme performance, random read performance, missing read performance. The experimental results of sequential read performance, random read performance, and missing read performance are shown in Figures 15 - 18. Due to the existence of cache and Bloom filter, the read performance of the LSMA scheme and LevelDB is almost There is no gap, even when the amount of data is small, for the improved sst, LSMA will have better various read performance. Even if the amount of data increases, the read performance gap between the LSMA scheme and LevelDB is within 5%. 6.SUMMARYThe paper takes the storage engine based on the LSM-Tree architecture as the research object, and mainly studies the core of the LSM-Tree—the merge operation, in order to reduce its write amplification, improve write performance, and minimize the impact on read performance. The LSMA scheme was proposed, and finally implemented based on Google’s classic LSM-Tree storage engine LevelDB. The comparative test experiment proved that, compared with LevelDB, the LSMA scheme has a large number of lightweight merges, which can reduce the probability of write pause and the IO of reading and writing. Competition, so it can effectively improve the performance of random writes to more than 50%, and the performance of overwrite writes to more than 20%. Sequential writes are similar to LevelDB because they do not use the characteristics of LSMA. In addition, LSMA is even slightly better than LevelDB when the amount of data is small. Even if the amount of data increases, the performance impact is within 5%. The next research work can expand the experimental test to a larger volume of data sets, and continue to explore the optimization of LSM-Tree in the direction of new hardware and distributed environments. 7.7.REFERENCES. [LevelDB, EB/OL,
(2022) http://LevelDB.org/ Google Scholar
. [Rocks DB, EB/OL,
(2022) http://rocksdb.org/ Google Scholar
Becker M Y, Sewell P,
“Cassandra: Flexible trust management, applied to electronic health records[C],”
in Proceedings of the 17th IEEE Computer Security Foundations Workshop,
139
–154
(2004). Google Scholar
. [HBase, EB/OL,
(2022) https://hbase.apache.org/ Google Scholar
Sears R, Ramakrishnan R,
“bLSM: a general purpose log structured merge tree[C],”
in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data,
217
–228
(2012). Google Scholar
Bloom B H,
“Space/time trade-offs in hash coding with allowable errors[J],”
Communications of the ACM, 13
(7), 422
–426
(1970). https://doi.org/10.1145/362686.362692 Google Scholar
Luo C, Carey M J,
“Efficient data ingestion and query processing for LSM-based storage systems[J],”
PVLDB, 531
–543
(2019). Google Scholar
Zhang H, Lim H, Leis V, et al,
“Surf: Practical range query filtering with fast succinct tries[C],”
in Proceedings of the 2018 International Conference on Management of Data,
323
–336
(2018). Google Scholar
Lu L, Pillai T S, Gopalakrishnan H, et al,
“Wisckey: Separating keys from values in ssd-conscious storage[J],”
ACM Transactions on Storage (TOS), 13
(1), 1
–28
(2017). https://doi.org/10.1145/3033273 Google Scholar
Lai C, Jiang S, Yang L, et al,
“Atlas: Baidu’s key-value storage system for cloud data[C],”
in 2015 31st Symposium on Mass Storage Systems and Technologies (MSST),
1
–14
(2015). Google Scholar
Pan F F, Yue Y L, Xiong J,
“dcompaction: Speeding up compaction of the lsm-tree via delayed compaction[J],”
Journal of Computer Science and Technology, 32
(1), 41
–54
(2017). https://doi.org/10.1007/s11390-017-1704-4 Google Scholar
Yao T, Wan J, Huang P, et al,
“Building efficient key-value stores via a lightweight compaction tree[J],”
ACM Transactions on Storage (TOS), 13
(4), 1
–28
(2017). https://doi.org/10.1145/3139922 Google Scholar
Balmau O, Didona D, Guerraoui R, et al,
“{TRIAD}: Creating Synergies Between Memory, Disk and Log in Log Structured {Key-Value} Stores[C],”
in 2017 USENIX Annual Technical Conference,
363
–375
(2017). Google Scholar
Balmau O, Dinu F, Zwaenepoel W, et al,
“SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores[C],”
in 2019 USENIX Annual Technical Conference (USENIX ATC 19),
753
–766
(2019). Google Scholar
Sarkar S, Papon T I, Staratzis D, et al,
“Lethe: A tunable delete-aware LSM engine[C],”
in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data,
893
–908
(2020). Google Scholar
Gong C, He S, Gong Y, et al,
“On integration of appends and merges in log-structured merge trees[C],”
in Proceedings of the 48th International Conference on Parallel Processing,
1
–10
(2019). Google Scholar
Dayan N, Idreos S,
“Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging[C],”
in Proceedings of the 2018 International Conference on Management of Data,
505
–520
(2018). Google Scholar
. [Special forest, EB/OL,
(2015) https://github.com/google/LevelDB/issues/320 Google Scholar
|