Методика векторизации вычислений с помощью плоских циклов

Alexey Anatolyevich Rybakov; Anton Dmitrievich Chopornyak; Alexander Sergeevich Shmelev

doi:10.25559/SITITO.022.202601.50-62

Alexey Anatolyevich Rybakov Национальный исследовательский центр "Курчатовский институт" http://orcid.org/0000-0002-9755-8830
Anton Dmitrievich Chopornyak Национальный исследовательский центр "Курчатовский институт" http://orcid.org/0000-0001-6617-5303
Alexander Sergeevich Shmelev Национальный исследовательский центр "Курчатовский институт" http://orcid.org/0000-0002-1941-7792

DOI: https://doi.org/10.25559/SITITO.022.202601.50-62

Аннотация

Векторизация является важной низкоуровневой оптимизацией программного кода, с помощью которой можно достичь кратного ускорения суперкомпьютерных приложений. Все основные современные микропроцессорные архитектуры поддерживают векторные вычисления, причем наблюдается тенденция на увеличение размера вектора (на сегодняшний день максимальная длина равна 512 битам, но уже сейчас логически эта длина не ограничена, учитывая наборы векторных инструкций с переменной длиной вектора). Сейчас наиболее перспективным набором векторных инструкций является набор AVX-512, так как в нем поддержана возможность выборочной обработки элементов векторов с помощью векторных масок. Эта уникальная возможность позволяет векторизовать сложный программный контекст, содержащий команды передачи управления, гнезда циклов и вызовы функций. В статье рассмотрены проблемы векторизации программного контекста различного вида с использованием набора векторных инструкций AVX-512. Рассмотренные методы векторизации применялись, начиная с первого микропроцессора с поддержкой AVX-512 – микропроцессора Intel Xeon Phi Knights Langing, появившегося в 2016-м году. Результаты исследований нашли свое отражение в ряде проектов НИЦ «Курчатовский институт» и Межведомственного суперкомпьютерного центра Российской академии наук (МСЦ РАН) и являются обобщением опыта векторизации программного кода, начиная с появления набора инструкций AVX-512.

Сведения об авторах

Alexey Anatolyevich Rybakov, Национальный исследовательский центр "Курчатовский институт"

ведущий научный сотрудник, кандидат физико-математических наук

Anton Dmitrievich Chopornyak, Национальный исследовательский центр "Курчатовский институт"

старший научный сотрудник

Alexander Sergeevich Shmelev, Национальный исследовательский центр "Курчатовский институт"

научный сотрудник

Литература

1. Cebrian J., Natvig L., Lahre M. Scalability analysis of AVX-512 extensions. The Journal of Supercomputing. 2020;76:2082-2097. https://doi.org/10.1007/s11227-019-02840-7
2. Kulikov I., Chernykh I., Tutukov A. A new hydrodynamic code with explicit vectorization instructions optimizations that is dedicated to the numerical simulation of astrophysical gas flow. I. Numerical method, tests, and model problem. The Astrophysical Journal Supplement Series. 2019;243(1):4. https://doi.org/10.3847/1538-4365/ab2237
3. Glinting B., Mundani R.-P. Comparison of shallow water solvers: applications for dam-break and tsunami cases with reordering strategy for efficient vectorization on modern hardware. Water. 2019;11(4):639. https://doi.org/10.3390/w11040639
4. Yildirim A., Mader C., Martins J. Accelerating parallel CFD codes on modern vector processors using blockettes. // PASC’21: Proceedings of the Platform for Advanced Scientific Computing Conference. New York, NY, USA: Association for Computing Machinery; 2021. Article number: 11. https://doi.org/10.1145/3468267.3470615
5. Rucci E., Moreno E., Pousa A., Chichizola F. Optimization of the N-Body Simulation on Intel’s Architectures Based on AVX-512 Instruction Set. In: Pesado P., Arroyo M. (eds.) Computer Science – CACIC 2019. CACIC 2019. Communications in Computer and Information Science. Vol. 1184. Cham: Springer; 2020. p. 37-52. https://doi.org/10.1007/978-3-030-48325-8_3
6. Rucci E., Garcia C., Botella G., De Giusti A. SWIMM 2.0: Enhanced Smith-Waterman on Intel’s multicore and manycore architectures based on AVX-512 vector extensions. International Journal of Parallel Programming. 2019;47(2):296-316. https://doi.org/10.1007/s10766-018-0585-7
7. Choi Y., Choi H., Chung S. AVX512Crypto: Parallel implementations of Korean block ciphers using AVX-512. IEEE Access. 2023;11:55094-55106. https://doi.org/10.1109/ACCESS.2023.3278993
8. Cheng H., Fotiadis G., Großsch¨adl J., Ryan P., Rønne P. Batching CSIDH group actions using AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems. 2021;2021(4):618-649. https://doi.org/10.46586/tches.v2021.i4.618-649
9. Kusswurm D. Modern parallel programming with C++ and Assembly Language. X86 SIMD development using AVX, AVX2, and AVX-512. CA, Apress Berkeley Publ.; 2022. 633 p. https://doi.org/10.1007/978-1-4842-7918-2
10. Blacher M., Giesen J., Sanders P., Wassenberg J. Vectorized and performance-portable Quicksort. arXiv:2205.05982. 2022. https://doi.org/10.48550/arXiv.2205.05982
11. Long S., Fan X., Chao L., Yi L., Fan S., Guo X.-W., Yang C. VecDualSPHysics: A vectorized implementation of Smoothed Particle Hydrodynamics method for simulating fluid flows on multi-core processors. Journal of Computational Physics. 2022;463:111234. https://doi.org/10.1016/j.jcp.2022.111234
12. Ponte-Fern´andez C., Gonz´alez-Dom´inguez J., Mart´in M.J. A SIMD algorithm for the detection of epistatic interactions of any order. Future Generation Computer Systems. 2022;132:108-123. https://doi.org/10.1016/j.future.2022.02.009
13. Quislant R., Fernandez I. Time series analysis acceleration with advanced vectorization extensions. The Journal of Supercomputing. 2023;79(9):10178-10207. https://doi.org/10.1007/s11227-023-05060-2
14. Buhrow B., Gilbert B., Haider C. Parallel modular multiplication using 512-bit advanced vector instructions. Journal of Cryptographic Engineering. 2022;12:95-105. https://doi.org/10.1007/s13389-021-00256-9
15. Choi H., Seo S.C. Efficient parallel implementations of PIPO Block Cipher on CPU and GPU. IEEE Access. 2022;10:85995-86007. https://doi.org/10.1109/ACCESS.2022.3198707
16. Cheng H., Fotiadis G., Großsch¨adl J., Ryan P. Highly vectorized SIKE for AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Sys. 2022;(2):41-68. https://doi.org/10.46586/tches.v2022.i2.41-68
17. Sansone G., Cococcioni M. Experiments on Speeding Up the Recursive Fast Fourier Transform by Using AVX-512 SIMD Instructions. In: Berta R., De Gloria A. (eds) Applications in Electronics Pervading Industry, Environment and Society. ApplePies 2022. Lecture Notes in Electrical Engineering. Vol. 1036. Cham: Springer; 2023. p. 255-263. https://doi.org/10.1007/978-3-031-30333-3_34
18. Edamatsu T., Takahashi D. Fast multiple-precision integer division using Intel AVX-512. IEEE Transactions on Emerging Topics in Computing. 2023;11(1):224-236. https://doi.org/10.1109/TETC.2022.3196147
19. Medakin P., Nikulin R., Avdeyuk O., Koroleva I., Pavlova E., Lemeshkina I. Vektorizaciya i rasparallelivanie metoda "chasticza-chasticza" [Vectorization and parallelization of the particle-particle method]. Inzhenerny`j vestnik Dona = Ingineering Journal of Don. 2021;(1):136-145. (In Russ., abstract in Eng.) EDN: NURRSF
20. Tayeb H., Paillat L., Bramas B. Autovesk: Automatic vectorization of unstructured static kernels by graph transformations. arXiv:2301.01018. 2023. https://doi.org/10.48550/arXiv.2301.01018
21. Laukemann J., Hammer J., Hager G., Wellein G. Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels. In: 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). Denver, CO, USA: IEEE Press; 2019. p. 1-6. https://doi.org/10.1109/PMBS49563.2019.00006
22. Volkonsky V., Okunev S. Predikatnoe predstavlenie kak osnova optimizacii programmy` dlya arxitektur s yavno vy`razhennoj parallel`nost`yu [Predicate representation as a basis for program optimization for architectures with explicit parallelism]. Informacionnye Tehnologii. 2003;(4):36-45. (In Russ., abstract in Eng.)
23. Perotti M., Cavalcante M., Wistoff N., Andri R., Cavigelli L., Benini L. A "New Ara" for vector computing: An open source highly efficient RISC-V V 1.0 vector processor design. In: 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP). Gothenburg, Sweden: IEEE Press; 2022. p. 43-51. https://doi.org/10.1109/ASAP54787.2022.00017
24. Bendersky L.A., Leshhev S.A., Rybakov A.A. Vectorization of Operations on Small-Dimensional Matrices for Intel Xeon Phi Knights Landing Processor. Modern Information Technologies and IT-Education. 2018;14(1):73-90. (In Russ., abstract in Eng.) https://doi.org/10.25559/SITITO.14.201801.073-090
25. Benderskij L.A., Rybakov A.A., Shumilin S.S. Vectorization of Small-Sized Special-Type Matrices Multiplication Using Instructions AVX-512. Modern Information Technologies and IT-Education. 2018;14(3):594-602. (In Russ., abstract in Eng.) https://doi.org/10.25559/SITITO.14.201803.594-602
26. Savin G.I., Shabanov B.M., Rybakov A.A., Shumilin S.S. Vectorization of flat loops of arbitrary structure using instructions AVX-512. Lobachevskii Journal of Mathematics. 2020;41(12):2566-2574. https://doi.org/10.1134/S1995080220120331
27. Rybakov A.A., Meshсheryakov A.O. Vectorization of the three-dimensional immersed boundary method for improving the efficiency of calculations on Intel microprocessors. Software & Systems. 2023;36(1):130-143. (In Russ., abstract in Eng.) https://doi.org/10.15827/0236-235X.141.130-143
28. Rybakov A.A. Optimization of the problem of conflict detection with dangerous aircraft movement areas to execute on Intel Xeon Phi. Software & Systems. 2017;30(3):524-528. (In Russ., abstract in Eng.) https://doi.org/10.15827/0236-235X.119.3.524-528
29. Rybakov A.A. Vectorization of Loops with Conditional Operations by Combining Vector Masks. Modern Information Technologies and IT-Education. 2024;20(3):563-572. (In Russ., abstract in Eng.) https://doi.org/10.25559/SITITO.020.202403.563-572
30. Rybakov A.A. Vectorization of finding the intersection of volume grid and surface grid for microprocessors with AVX-512 support. SRISA Proceedings. 2019;9(5):5-14. (In Russ., abstract in Eng.) https://doi.org/10.25682/NIISI.2019.5.0001
31. Rybakov A.A., Shumilin S.S. Vectorization of the Riemann solver using the AVX-512 instruction set. Program Systems: Theory and Applications. 2019;10(3):41-58. https://doi.org/10.25209/2079-3316-2019-10-3-41-58
32. Rybakov A.A., Shumilin S.S. Study of the vectorization efficiency of loop nests with an irregular number of iterations”. Program Systems: Theory and Applications. 2019;10(4):77-96. (In Russ., abstract in Eng.) https://doi.org/10.25209/2079-3316-2019-10-4-77-96
33. Rybakov A.A. Vectorization of the integer calculations in the graph decomposition problem. Lobachevskii Journal of Mathematics. 2025;46(11):6012-6018. https://doi.org/10.1134/S1995080225610987

Методика векторизации вычислений с помощью плоских циклов

Аннотация

Сведения об авторах

Литература

Наиболее читаемые статьи этого автора (авторов)