Evaluation of the Temporal Efficiency of Big Data Storage Formats in the Dynamics of Data Growth
Abstract
When developing a data lake on platforms such as Apache Hadoop, the choice of data storage format becomes an important issue. This choice should be based on a number of different criteria, one of which is the time it takes to run different queries on this data. However, any data processing system assumes a constant growth in the volume of this data. In this regard, it becomes necessary to study the effectiveness of formats in the dynamics of growth in the amount of data stored in the system. This article proposes a methodology for assessing the effectiveness of data storage formats in data lakes built on the Apache Hadoop platform in the dynamics of data growth. An experiment is proposed, which is a series of queries of varying complexity to data stored in JSON, Apache Avro, ORC, Apache Parquet formats. The Apache Spark framework was used to run queries.
References
2. Lee S., Jo J. -Y., Kim Y. Survey of Data Locality in Apache Hadoop. 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD). IEEE Press, Honolulu, HI, USA; 2019. p. 46-53. (In Eng.) doi: https://doi.org/10.1109/BCD.2019.8885148
3. Bourhis P., Reutter J.L., Suárez F., Vrgoč D. JSON: Data model, Query languages and Schema specification. Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '17). Association for Computing Machinery, New York, NY, USA; 2017. p. 123-135. (In Eng.) doi: https://doi.org/10.1145/3034786.3056120
4. Boufea A., Finkers R., van Kaauwen M., Kramer M., Athanasiadis I.N. Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). Association for Computing Machinery, New York, NY, USA; 2017. p. 219-226. (In Eng.) doi: https://doi.org/10.1145/3148055.3148060
5. Gohil A., Shroff A., Garg A., Kumar S. A Compendious Research on Big Data File Formats. 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE Press, Madurai, India; 2022. p. 905-913. (In Eng.) doi: https://doi.org/10.1109/ICICCS53718.2022.9788141
6. Durner D., Leis V., Neumann T. JSON Tiles: Fast Analytics on Semi-Structured Data. Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). Association for Computing Machinery, New York, NY, USA; 2021. p. 445-458. (In Eng.) doi: https://doi.org/10.1145/3448016.3452809
7. Ramírez A., Parejo J.A., Romero J.R., Segura S., Ruiz-Cortés A. Evolutionary composition of QoS-aware web services: A many-objective perspective. Expert Systems with Applications. 2017; 72:357-370. (In Eng.) doi: https://doi.org/10.1016/j.eswa.2016.10.047
8. Gholamshahi S., Hasheminejad S.M.H. Software component identification and selection: A research review. Software: Practice and Experience. 2019; 49(1):40-69. (In Eng.) doi: https://doi.org/10.1002/spe.2656
9. Munir R.F., Abelló A., Romero O., Thiele M., Lehner W. A cost-based storage format selector for materialized results in big data frameworks. Distributed and Parallel Databases. 2020; 38(2):335-364. (In Eng.) doi: https://doi.org/10.1007/s10619-019-07271-0
10. Wang X., Xie Z. The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). Association for Computing Machinery, New York, NY, USA; 2020. p. 177-186. (In Eng.) doi: https://doi.org/10.1145/3383583.3398542
11. He D., Wu D., Huang R., Marchionini G., Hansen P., Cunningham S.J. ACM/IEEE Joint Conference on Digital Libraries 2020 in Wuhan virtually. ACM SIGWEB Newsletter. 2020; (1):1-7. (In Eng.) doi: https://doi.org/10.1145/3427478.3427479
12. Belov V.A., Nikulchev E.V. Experimental evaluation of the temporal efficiency of big data processing for specified storage formats. International Journal of Open Information Technologies. 2021; 9(9):95-102. Available at: https://www.elibrary.ru/item.asp?id=46515796 (accessed 27.08.2021). (In Russ., abstract in Eng.)
13. Salloum S., Dautov R., Chen X., Peng P.X., Huang J.Z. Big data analytics on Apache Spark. International Journal of Data Science and Analytics. 2016; 1(3):145-164. (In Eng.) doi: https://doi.org/10.1007/s41060-016-0027-9
14. Chong D., Shi H. Big data analytics: a literature review. Journal of Management Analytics. 2015; 2(3):175-201. (In Eng.) doi: https://doi.org/10.1080/23270012.2015.1082449
15. Moro Visconti R., Morea D. Big Data for the Sustainability of Healthcare Project Financing. Sustainability. 2019; 11(13):3748. (In Eng.) doi: https://doi.org/10.3390/su11133748
16. Cappa F., Oriani R., Peruffo E., McCarthy I.P. Big Data for Creating and Capturing Value in the Digitalized Environment: Unpacking the Effects of Volume, Variety and Veracity on Firm Performance. Journal of Product Innovation Management. 2021; 38(1):49-67. (In Eng.) doi: https://doi.org/10.1111/jpim.12545
17. Nazari E., Shahriari M.H., Tabesh H. Big Data Analysis in Healthcare: Apache Hadoop, Apache spark and Apache Flink. Frontiers in Health Informatics. 2019; 8(1):e14. (In Eng.) doi: http://dx.doi.org/10.30699/fhi.v8i1.180
18. Gusev A., Ilin D., Nikulchev E. The Dataset of the Experimental Evaluation of Software Components for Application Design Selection Directed by the Artificial Bee Colony Algorithm. Data. 2020; 5(3):59. (In Eng.) doi: https://doi.org/10.3390/data5030059
19. Belov V., Tatarintsev A., Nikulchev E. Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark. Symmetry. 2021; 13(2):195. (In Eng.) doi: https://doi.org/10.3390/sym13020195
20. Shahzad A., Usman Ali M., Ferzund J., Sarwar M.A., Rehman A., Mehmood A. Modern Data Formats for Big Bioinformatics Data Analytics. International Journal of Advanced Computer Science and Applications. 2017; 8(4):366-377. (In Eng.) doi: http://dx.doi.org/10.14569/IJACSA.2017.080450
21. Plase D., Niedrite L., Taranovs R. A comparison of HDFS compact data formats: Avro versus Parquet. Mokslas – Lietuvos Ateitis = Science – Future of Lithuania. 2017; 9(3):267-276. (In Eng.) doi: https://doi.org/10.3846/mla.2017.1033
22. Sakr S., Liu A., Fayoumi A.G. The family of MapReduce and large-scale data processing systems. ACM Computing Surveys. 2013; 46(1):11. (In Eng.) doi: https://doi.org/10.1145/2522968.2522979
23. Ene A., Im S., Moseley B. Fast clustering using MapReduce. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). Association for Computing Machinery, New York, NY, USA; 2011. p. 681-689. (In Eng.) doi: https://doi.org/10.1145/2020408.2020515
24. Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems. 2008; 26(2):4. (In Eng.) doi: https://doi.org/10.1145/1365815.1365816
25. Assunção M.D., Calheiros R.N., Bianchi S., Netto M.A.S., Buyya R. Big Data computing and clouds. Journal of Parallel and Distributed Computing. 2015; 79(C):3-15. (In Eng.) doi: https://doi.org/10.1016/j.jpdc.2014.08.003

This work is licensed under a Creative Commons Attribution 4.0 International License.
Publication policy of the journal is based on traditional ethical principles of the Russian scientific periodicals and is built in terms of ethical norms of editors and publishers work stated in Code of Conduct and Best Practice Guidelines for Journal Editors and Code of Conduct for Journal Publishers, developed by the Committee on Publication Ethics (COPE). In the course of publishing editorial board of the journal is led by international rules for copyright protection, statutory regulations of the Russian Federation as well as international standards of publishing.
Authors publishing articles in this journal agree to the following: They retain copyright and grant the journal right of first publication of the work, which is automatically licensed under the Creative Commons Attribution License (CC BY license). Users can use, reuse and build upon the material published in this journal provided that such uses are fully attributed.