Horizontably Scalable LSM-tree Based Index for Full-Text Search Boolean Queries

Abstract

Let there be a set of objects, where each object is characterized by Boolean features. Let logical query or Boolean query denote a query to find objects characterized by a Boolean function by features, for example, "documents with all words . . . and any words . . . and without words . . . ". The terms "object" and "document" are used interchangeably hereinafter. The features may have different semantics, i. e. some features may correspond to the words of the document, some to labels or categories, aand some to per-bit time quantum of a document’s date.
Although indices executing Boolean queries are well researched in the literature, the common technique of maintaining posting lists is not always acceptable. If the data volume can reach to the order of petabytes, a compact index structure becomes vital.
The aim of the research is to suggest a method to build an efficient bitmap index in secondary memory that allows us to update or append data being indexed with high write throughput.
We propose an efficient LSM-based index for bitmaps and feature design for practical applications. We also discuss aspects of building of joined indices in order to achieve good scalability. The paper describes the architecture and algorithms of a suggested index and includes results of our experiments that show sustainable performance of our solution.

Author Biography

Aleksei Mikhailovich Neganov, Moscow Institute of Physics and Technology (National Research University)

Postgraduate Student

References

1. Foster I., Kesselman C., Tuecke S. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. The International Journal of High Performance Computing Applications. 2001; 15(3):200-222. (In Eng.) doi: https://doi.org/10.1177/109434200101500302
2. Ponte J.M., Croft W.B. A language modeling approach to information retrieval. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'98). Association for Computing Machinery, New York, NY, USA; 1998. p. 275-281. (In Eng.) doi: https://doi.org/10.1145/290941.291008
3. Mattmann C.A. A vision for data science. Nature. 2013; 493(7433):473-475. (In Eng.) doi: https://doi.org/10.1038/493473a
4. Marx V. The big challenges of big data. Nature. 2013; 498(7453):255-260. (In Eng.) doi: https://doi.org/10.1038/498255a
5. Chen C.L.P., Zhang C.-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 2014; 275:314-347. (In Eng.) doi: https://doi.org/10.1016/j.ins.2014.01.015
6. Anderson D., Dykes J., Riedel E. More Than an Interface ‒ SCSI vs. ATA. Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST'03). San Francisco, CA: USENIX Association; 2003. p. 245-257. Available at: https://www.usenix.org/conference/fast-03/more-interface%E2%80%94scsi-vs-ata (accessed 13.01.2022). (In Eng.)
7. Chen F., Koufaty D.A., Zhang X. Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives. ACM SIGMETRICS Performance Evaluation Review. 2009; 37(1):181-192. (In Eng.) doi: https://doi.org/10.1145/2492101.1555371
8. Picoli I.L., et al. UFLIP-OC: Understanding Flash I/O Patterns on Open-Channel Solid-State Drives. Proceedings of the 8th Asia-Pacific Workshop on Systems (APSys'17). Association for Computing Machinery, New York, NY, USA; 2017. Article number: 20. p. 1-7. (In Eng.) doi: https://doi.org/10.1145/3124680.3124741
9. Hu X.-Y., et al. Write Amplification Analysis in Flash-based Solid State Drives. Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (SYSTOR'09). Association for Computing Machinery, New York, NY, USA; 2009. Article number: 10. p. 1-9. (In Eng.) doi: https://doi.org/10.1145/1534530.1534544
10. O'Neil P., et al. The Log-structured Merge-tree (LSM-tree). Acta Informatica. 1996; 33(4):351-385. (In Eng.) doi: https://doi.org/10.1007/s002360050048
11. Chang F., et al. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems. 2008; 26(2):4. (In Eng.) doi: https://doi.org/10.1145/1365815.1365816
12. DeCandia G., et al. Dynamo: amazon's highly available key-value store. ACM SIGOPS Operating Systems Review. 2007; 41(6):205-220. (In Eng.) doi: https://doi.org/10.1145/1323293.1294281
13. Hassan M.U., et al. A Comprehensive Study of HBase Storage Architecture ‒ A Systematic Literature Review. Symmetry. 2021; 13(1):109. (In Eng.) doi: https://doi.org/10.3390/sym13010109
14. Lakshman A., Malik P. Cassandra ‒ A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review. 2010; 44(2):35-40. (In Eng.) doi: https://doi.org/10.1145/1773912.1773922
15. Luo C., Carey M.J. LSM-based storage techniques: a survey. The VLDB Journal. 2020; 29(1):393-418. (In Eng.) doi: https://doi.org/10.1007/s00778-019-00555-y
16. Dong S., et al. RocksDB: Evolution of Development Priorities in a Key-Value Store Serving Large-Scale Applications. ACM Transactions on Storage. 2021; 17(4):26. (In Eng.) doi: https://doi.org/10.1145/3483840
17. Alsubaiee S., et al. AsterixDB: A Scalable, Open Source BDMS. Proceedings of the VLDB Endowment. 2014; 7(14):1905-1916. Available at: https://www.vldb.org/pvldb/vol7/p1905-alsubaiee.pdf (accessed 13.01.2022). (In Eng.)
18. Lim H., Andersen D.G., Kaminsky M. Towards Accurate and Fast Evaluation of Multi-stage Log-structured Designs. Proceedings of the 14th Usenix Conference on File and Storage Technologies (FAST'16). Santa Clara, CA: USENIX Association; 2016. p. 149-166. Available at: https://www.usenix.org/system/files/conference/fast16/fast16-papers-lim.pdf (accessed 13.01.2022). (In Eng.)
19. Chambi S., et al. Better Bitmap Performance with Roaring Bitmaps. Software: Practice and Experience. 2016; 46(5):709-719. (In Eng.) doi: https://doi.org/10.1002/spe.2325
20. Lemire D., Ssi-Yan-Kai G., Kaser O. Consistently faster and smaller compressed bitmaps with Roaring”. Software: Practice and Experience. 2016; 46(11):1547-1569. (In Eng.) doi: https://doi.org/10.1002/spe.2402
21. Wang J., et al. An Experimental Study of Bitmap Compression vs. Inverted List Compression. Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD'17). Association for Computing Machinery, Chicago, Illinois, USA; 2017. p. 993-1008. (In Eng.) doi: https://doi.org/10.1145/3035918.3064007
22. Pinar A., Tao T., Ferhatosmanoglu H. Compressing Bitmap Indices by Data Reorganization. Proceedings of the 21st International Conference on Data Engineering (ICDE'05). USA: IEEE Computer Society; 2005. p. 310-321. (In Eng.) doi: https://doi.org/10.1109/ICDE.2005.35
23. Hsu Y.-F., et al. A Novel Automated Cloud Storage Tiering System through Hot-Cold Data Classification. 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE Press, San Francisco, CA, USA; 2018. p. 492-499. (In Eng.) doi: https://doi.org/10.1109/CLOUD.2018.00069
24. Dayarathna M., Wen Y., Fan R. Data Center Energy Consumption Modeling: A Survey. IEEE Communications Surveys & Tutorials. 2016; 18(1):732-794. (In Eng.) doi: https://doi.org/10.1109/COMST.2015.2481183
25. Liu S., Huang X., Fu H., Yang G. Understanding Data Characteristics and Access Patterns in a Cloud Storage System. 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. Delft, Netherlands; 2013. p. 327-334. (In Eng.) doi: https://doi.org/10.1109/CCGrid.2013.11
Published
2022-03-31
How to Cite
NEGANOV, Aleksei Mikhailovich. Horizontably Scalable LSM-tree Based Index for Full-Text Search Boolean Queries. Modern Information Technologies and IT-Education, [S.l.], v. 18, n. 1, p. 134-143, mar. 2022. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/848>. Date accessed: 03 july 2024. doi: https://doi.org/10.25559/SITITO.18.202201.134-143.
Section
Research and development in the field of new IT and their applications