VECTORIZATION OF OPERATIONS ON SMALL- DIMENSIONAL MATRICES FOR INTEL XEON PHI KNIGHTS LANDING PROCESSOR

Abstract

The article is devoted to the vectorization of calculations for Intel Xeon Phi Knights Landing (KNL) processor. Small-dimensional matrices are considered as objects for optimization. These operations are wide common in calculation codes in various scopes of research, for example, in calculational fluid dynamics. KNL is the latter Intel Xeon Phi processor, it contains up to 72 calculational cores and allows running applications using massive parallelism. They implement wide range of opportunities for effective performance of supercomputer calculations. In particular, they support different memory and cluster modes. In many cases the compiler isn't able to create high-performance parallel vectorized execution code. This leads to performance losses. One of the reserves of improving code performance is the manual vectorization of the hot blocks of the code. This leads to the entire application acceleration. An important step in the program optimizing when using KNL processors is applying special 512-bit vector instructions that can significantly increase the speed of the execution code. Using of 512-bit vector instructions allows processing vectors consisting of 16 floating-point values. Special fused multiply-add instructions allow us to combine operations of componentwise multiplication and addition of these vectors. For simplification of the manual vectorization of the program code, special intrinsic functions are used. In fact these functions are just wrappers over the processor instructions. Vectorization of operations on matrices, performed with the intrinsic functions, made it possible to reduce the execution time of these operations in the range from 23% to 70% in comparison with the version compiled by the Intel compiler with the maximum level of optimization. The results received show additional hidden performance reserves of applications that can be obtained by manual optimization of the source code.

Author Biographies

Леонид Александрович Бендерский, Scientific Research Institute for System Analysis of the Russian Academy of Sciences, SRISA

Senior Researcher, Joint Supercomputer Center of the Russian Academy of Sciences

Сергей Алексеевич Лещев, Scientific Research Institute for System Analysis of the Russian Academy of Sciences, SRISA

Research associate, Joint Supercomputer Center of the Russian Academy of Sciences

Алексей Анатольевич Рыбаков, Scientific Research Institute for System Analysis of the Russian Academy of Sciences, SRISA

Candidate of Physical and Mathematical Sciences,  Lead researcher,  Joint Supercomputer Center of the Russian Academy of Sciences

References

[1] Rettinger C., Godenschwager C., Eibl S., et al. Fully Resolved Simulations of Dune Formation in Riverbeds // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 3–21. DOI: https://doi.org/10.1007/978-3-319-58667-0_1
[2] Krappel T., Riedelbauch S. Scale Resolving Flow Simulations of a Francis Turbine Using Highly Parallel CFD Simulations // W.E. Nagel et al. (Eds.): High Performance Computing in Science and Engineering'16, 2016. p. 499-510. DOI: https://doi.org/10.1007/978-3-319-47066-5_34
[3] Fu H.H., Liao J.F., Yang J.Z., et al. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences. 2016; 59(7), id. 072001. DOI: https://doi.org/10.1007/s11432-016-5588-7
[4] Markidis S., Peng I. B., Träff J. L., et al. The EPiGRAM Project: Preparing Parallel Programming Models for Exascale // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 994, p. 56–68. DOI: https://doi.org/10.1007/978-3-319-46079-6_5
[5] Klenk B., Fröning H. An Overview of MPI Characteristics of Exascale Proxy Applications // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 217–236. DOI: https://doi.org/10.1007/978-3-319-58667-0_12
[6] Abduljabbar M., Markomanolis G. S., Ibeid H., et al. Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 79–96. DOI: https://doi.org/10.1007/978-3-319-58667-0_5
[7] Rybakov A. Inner representation and crossprocess exchange mechanism for block-structured grid for supercomputer calculations. Program systems: Theory and applications. 2017; 8:1(32):121–134. (In Russian) DOI: https://doi.org/10.25209/2079-3316-2017-8-1-121-134
[8] Van der Wijngaart R. F., Georganas E., Mattson T. G., et al. A New Parallel Research Kernel to Expand Research on Dynamic Load-Balancing Capabilities // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 256–274. DOI: https://doi.org/10.1007/978-3-319-58667-0_14
[9] Benderskij L.A., Ljubimov D.A., Rybakov A.A. Scaling of fluid dynamic calculations using the RANS/ILES method on supercomputer. Trudy NIISI RAN. 2017; 7(4):32-40. Available at: https://elibrary.ru/item.asp?id=32294100 (accessed 10.02.2018). (In Russian)
[10] Heller T., Kaiser H., Diehl P. et al. Closing the Performance Gap with Modern C++ // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 18–31. DOI: https://doi.org/10.1007/978-3-319-46079-6_2
[11] Roganov V.A., Osipov V.I., Matveev G.A. Solving the 2D Poisson PDE by Gauss-Seidel method with parallel programming system OpenTS. Program Systems: Theory and Applications. 2016; 7(3):99-107. (In Russian) DOI: https://doi.org/10.25209/2079-3316-2016-7-3-99-107
[12] Bramas B. Fast sorting algorithms using AVX-512 on Intel Knights Landing // arXiv: 1704.08579 [cs.MS]. Available at: https://arxiv.org/abs/1704.08579 (accessed 10.02.2018).
[13] Sokolov A.P., Shhetinin V.N., Sapelkin A.S. Strength surface reconstruction using special parallel algorithm based on Intel MIC (Intel Many Integrated Core) architecture. Program Systems: Theory and Applications. 2016; 7(2):3-25. (In Russian) DOI: https://doi.org/10.25209/2079-3316-2016-7-2-3-25
[14] Dorris J., Kurzak J., Luszczek P. Task-Based Cholesky Decomposition on Knights Corner Using OpenMP. // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 544–562. DOI: https://doi.org/10.1007/978-3-319-46079-6_37
[15] Tobin J., Breuer A., Heinecke A. et al. Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 139–157. DOI: https://doi.org/10.1007/978-3-319-58667-0_8
[16] McDoniel W., Höhnerbach M., Canales R. et al. LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 61–78. DOI: https://doi.org/10.1007/978-3-319-58667-0_4
[17] Malas T., Kurth T., Deslippe J. Optimization of the Sparse Matrix-Vector Products of an IDR Krylov Iterative Solver in EMGeo for the Intel KNL Manycore Processor // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 378–389. DOI: https://doi.org/10.1007/978-3-319-46079-6_27
[18] Krzikalla O., Wende F., Höhnerbach M. Dynamic SIMD Vector Lane Scheduling // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 354–365. DOI: https://doi.org/10.1007/978-3-319-46079-6_25
[19] Cook B., Maris P., Shao M. High Performance Optimizations for Nuclear Physics Code MFDn on KNL // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 366–377. DOI: https://doi.org/10.1007/978-3-319-46079-6_26
[20] Rybakov A.A. Optimization of the problem of conflict detection with dangerous aircraft movement areas to execute on Intel Xeon Phi. Programmnye produkty i sistemy [Software & Systems]. 2017; 30(3):524-528. (In Russian) DOI: https://doi.org/10.15827/0236-235.X.030.3.524-528
[21] Sengupta D., Wang Y. Sundaram N. et al. High-Performance Incremental SVM Learning on Intel Xeon Phi Processors // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 120–138. DOI: https://doi.org/10.1007/978-3-319-58667-0_7
[22] Kronbichler M., Kormann K., Pasichnyk I. Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures // J.M. Kunkel et al. (Eds.): ISC High Performance 2017, LNCS, 2017. Vol. 10266, p. 237–255. DOI: https://doi.org/10.1007/978-3-319-58667-0_13
[23] Doerfler D., Deslippe J., Williams S. et al. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 339–353. DOI: https://doi.org/10.1007/978-3-319-46079-6_24
[24] Rosales C., Cazes J., Milfeld K. A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor // M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS, 2016. Vol. 9945, p. 307–318. DOI: https://doi.org/10.1007/978-3-319-46079-6_22
[25] Dikarev N.I., Shabanov B.M., Shmelev A.S. Fused MultiplyAdders Using in Vector Dataflow Processor. Program Systems: Theory and Applications. 2015; 6(4):227-241. (In Russian) DOI: 10.25209/2079-3316-2015-6-4-227-241
Published
2018-03-30
How to Cite
БЕНДЕРСКИЙ, Леонид Александрович; ЛЕЩЕВ, Сергей Алексеевич; РЫБАКОВ, Алексей Анатольевич. VECTORIZATION OF OPERATIONS ON SMALL- DIMENSIONAL MATRICES FOR INTEL XEON PHI KNIGHTS LANDING PROCESSOR. Modern Information Technologies and IT-Education, [S.l.], v. 14, n. 1, p. 73-90, mar. 2018. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/343>. Date accessed: 27 sep. 2025. doi: https://doi.org/10.25559/SITITO.14.201801.073-090.
Section
Parallel and distributed programming, grid technologies, programming on GPUs