Evaluation of the Temporal Efficiency of Big Data Storage Formats in the Dynamics of Data Growth

Abstract

When developing a data lake on platforms such as Apache Hadoop, the choice of data storage format becomes an important issue. This choice should be based on a number of different criteria, one of which is the time it takes to run different queries on this data. However, any data processing system assumes a constant growth in the volume of this data. In this regard, it becomes necessary to study the effectiveness of formats in the dynamics of growth in the amount of data stored in the system. This article proposes a methodology for assessing the effectiveness of data storage formats in data lakes built on the Apache Hadoop platform in the dynamics of data growth. An experiment is proposed, which is a series of queries of varying complexity to data stored in JSON, Apache Avro, ORC, Apache Parquet formats. The Apache Spark framework was used to run queries.

Author Biographies

Vladimir Alexandrovich Belov, MIREA – Russian Technological University

Postgraduate Student

Evgeny Vitalyevich Nikulchev, MIREA – Russian Technological University

Professor of the Intelligent Cyber-Security System Department, Dr.Sci. (Tech.), Professor

References

1. Mavridis I., Karatza H. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. Journal of Systems and Software. 2017; 125(C):133-151. (In Eng.) doi: https://doi.org/10.1016/j.jss.2016.11.037
2. Lee S., Jo J. -Y., Kim Y. Survey of Data Locality in Apache Hadoop. 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD). IEEE Press, Honolulu, HI, USA; 2019. p. 46-53. (In Eng.) doi: https://doi.org/10.1109/BCD.2019.8885148
3. Bourhis P., Reutter J.L., Suárez F., Vrgoč D. JSON: Data model, Query languages and Schema specification. Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '17). Association for Computing Machinery, New York, NY, USA; 2017. p. 123-135. (In Eng.) doi: https://doi.org/10.1145/3034786.3056120
4. Boufea A., Finkers R., van Kaauwen M., Kramer M., Athanasiadis I.N. Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). Association for Computing Machinery, New York, NY, USA; 2017. p. 219-226. (In Eng.) doi: https://doi.org/10.1145/3148055.3148060
5. Gohil A., Shroff A., Garg A., Kumar S. A Compendious Research on Big Data File Formats. 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE Press, Madurai, India; 2022. p. 905-913. (In Eng.) doi: https://doi.org/10.1109/ICICCS53718.2022.9788141
6. Durner D., Leis V., Neumann T. JSON Tiles: Fast Analytics on Semi-Structured Data. Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). Association for Computing Machinery, New York, NY, USA; 2021. p. 445-458. (In Eng.) doi: https://doi.org/10.1145/3448016.3452809
7. Ramírez A., Parejo J.A., Romero J.R., Segura S., Ruiz-Cortés A. Evolutionary composition of QoS-aware web services: A many-objective perspective. Expert Systems with Applications. 2017; 72:357-370. (In Eng.) doi: https://doi.org/10.1016/j.eswa.2016.10.047
8. Gholamshahi S., Hasheminejad S.M.H. Software component identification and selection: A research review. Software: Practice and Experience. 2019; 49(1):40-69. (In Eng.) doi: https://doi.org/10.1002/spe.2656
9. Munir R.F., Abelló A., Romero O., Thiele M., Lehner W. A cost-based storage format selector for materialized results in big data frameworks. Distributed and Parallel Databases. 2020; 38(2):335-364. (In Eng.) doi: https://doi.org/10.1007/s10619-019-07271-0
10. Wang X., Xie Z. The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). Association for Computing Machinery, New York, NY, USA; 2020. p. 177-186. (In Eng.) doi: https://doi.org/10.1145/3383583.3398542
11. He D., Wu D., Huang R., Marchionini G., Hansen P., Cunningham S.J. ACM/IEEE Joint Conference on Digital Libraries 2020 in Wuhan virtually. ACM SIGWEB Newsletter. 2020; (1):1-7. (In Eng.) doi: https://doi.org/10.1145/3427478.3427479
12. Belov V.A., Nikulchev E.V. Experimental evaluation of the temporal efficiency of big data processing for specified storage formats. International Journal of Open Information Technologies. 2021; 9(9):95-102. Available at: https://www.elibrary.ru/item.asp?id=46515796 (accessed 27.08.2021). (In Russ., abstract in Eng.)
13. Salloum S., Dautov R., Chen X., Peng P.X., Huang J.Z. Big data analytics on Apache Spark. International Journal of Data Science and Analytics. 2016; 1(3):145-164. (In Eng.) doi: https://doi.org/10.1007/s41060-016-0027-9
14. Chong D., Shi H. Big data analytics: a literature review. Journal of Management Analytics. 2015; 2(3):175-201. (In Eng.) doi: https://doi.org/10.1080/23270012.2015.1082449
15. Moro Visconti R., Morea D. Big Data for the Sustainability of Healthcare Project Financing. Sustainability. 2019; 11(13):3748. (In Eng.) doi: https://doi.org/10.3390/su11133748
16. Cappa F., Oriani R., Peruffo E., McCarthy I.P. Big Data for Creating and Capturing Value in the Digitalized Environment: Unpacking the Effects of Volume, Variety and Veracity on Firm Performance. Journal of Product Innovation Management. 2021; 38(1):49-67. (In Eng.) doi: https://doi.org/10.1111/jpim.12545
17. Nazari E., Shahriari M.H., Tabesh H. Big Data Analysis in Healthcare: Apache Hadoop, Apache spark and Apache Flink. Frontiers in Health Informatics. 2019; 8(1):e14. (In Eng.) doi: http://dx.doi.org/10.30699/fhi.v8i1.180
18. Gusev A., Ilin D., Nikulchev E. The Dataset of the Experimental Evaluation of Software Components for Application Design Selection Directed by the Artificial Bee Colony Algorithm. Data. 2020; 5(3):59. (In Eng.) doi: https://doi.org/10.3390/data5030059
19. Belov V., Tatarintsev A., Nikulchev E. Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark. Symmetry. 2021; 13(2):195. (In Eng.) doi: https://doi.org/10.3390/sym13020195
20. Shahzad A., Usman Ali M., Ferzund J., Sarwar M.A., Rehman A., Mehmood A. Modern Data Formats for Big Bioinformatics Data Analytics. International Journal of Advanced Computer Science and Applications. 2017; 8(4):366-377. (In Eng.) doi: http://dx.doi.org/10.14569/IJACSA.2017.080450
21. Plase D., Niedrite L., Taranovs R. A comparison of HDFS compact data formats: Avro versus Parquet. Mokslas – Lietuvos Ateitis = Science – Future of Lithuania. 2017; 9(3):267-276. (In Eng.) doi: https://doi.org/10.3846/mla.2017.1033
22. Sakr S., Liu A., Fayoumi A.G. The family of MapReduce and large-scale data processing systems. ACM Computing Surveys. 2013; 46(1):11. (In Eng.) doi: https://doi.org/10.1145/2522968.2522979
23. Ene A., Im S., Moseley B. Fast clustering using MapReduce. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). Association for Computing Machinery, New York, NY, USA; 2011. p. 681-689. (In Eng.) doi: https://doi.org/10.1145/2020408.2020515
24. Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems. 2008; 26(2):4. (In Eng.) doi: https://doi.org/10.1145/1365815.1365816
25. Assunção M.D., Calheiros R.N., Bianchi S., Netto M.A.S., Buyya R. Big Data computing and clouds. Journal of Parallel and Distributed Computing. 2015; 79(C):3-15. (In Eng.) doi: https://doi.org/10.1016/j.jpdc.2014.08.003
Published
2021-12-20
How to Cite
BELOV, Vladimir Alexandrovich; NIKULCHEV, Evgeny Vitalyevich. Evaluation of the Temporal Efficiency of Big Data Storage Formats in the Dynamics of Data Growth. Modern Information Technologies and IT-Education, [S.l.], v. 17, n. 4, p. 889-895, dec. 2021. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/809>. Date accessed: 22 aug. 2025. doi: https://doi.org/10.25559/SITITO.17.202104.889-895.