Collecting Profile Information Using Application Execution Traces for Static Optimizing Binary Translation

Abstract

Static binary optimization is one of the ways to speed up an executable file without source code. This technology is used in the BOLT (Binary Optimization and Layout Tool), which requires a profile with information about taken and information about not predicted transitions for its work, which can be obtained from the LBR (Last Branch Record) hardware queue on the x86 architecture. This information is necessary for the BOLT in order to relocate the code, that reduces the number of misses on the instruction cache and the buffer of associative translation of instructions. As a result of the optimization, it was possible to speed up the work of server applications by 8.0% running on x86 architectures. There are limitations on the use of this optimization on the ARM architecture due to the frequent lack of the ability to obtain profile information using hardware. This article describes the developed methods and tools that allow us to obtain profile information using the application execution route. The process of collecting the trace is implemented using dynamic binary instrumentation. The article describes an algorithm for restoring profile information using a branch predictor model. The implemented BOLT changes for extended ARM architecture support are also described. As a result of the work, it was possible to achieve performance growth targets on synthetic tests and benchmarks.

Author Biography

Sergey Alekseevich Lisitsyn, Huawei Russian Research Institute; Moscow Institute of Physics and Technology (National Research University)

Lead Engineer; Postgraduate Student

References

1. Lebras Y., Charif-Rubial A.S., Jalby W. Combining static and dynamic analysis to guide PGO for HPC applications: a case study on real-world applications. 2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE Press, Dublin, Ireland; 2019. p. 513-520. (In Eng.) DOI: https://doi.org/10.1109/HPCS48598.2019.9188161
2. Nethercote N., Seward J. Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices. 2007; 42(6):89-100. (In Eng.) DOI: https://doi.org/10.1145/1273442.1250746
3. Li J., Ma X., Zhu C. Dynamic Binary Translation and Optimization. Journal of Computer Research & Development. 2007. 44(1):161. (In Eng.) DOI: https://doi.org/10.1360/crad20070123
4. Panchenko M., Auler R., Nell B., Ottoni G. BOLT: A Practical Binary Optimizer for Data Centers and beyond. 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Washington, DC, USA; 2019. p. 2-14. (In Eng.) DOI: https://doi.org/10.1109/CGO.2019.8661201
5. Ottoni G., Maher B. Optimizing function placement for large-scale data-center applications. 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Austin, TX, USA; 2017. p. 233-244. (In Eng.) DOI: https://doi.org/10.1109/CGO.2017.7863743
6. Newell A., Pupyrev S. Improved Basic Block Reordering. IEEE Transactions on Computers. 2020; 69(12):1784-1794. (In Eng.) DOI: https://doi.org/10.1109/TC.2020.2982888
7. Blem E., Menon J., Sankaralingam K. Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE Press, Shenzhen, China; 2013. p. 1-12. (In Eng.) DOI: https://doi.org/10.1109/HPCA.2013.6522302
8. Panchenko M., Auler R., Sakka L., Ottoni G. Lightning BOLT: powerful, fast, and scalable binary optimization. Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (CC 2021). Association for Computing Machinery, New York, NY, USA; 2021. p. 119-130. (In Eng.) DOI: https://doi.org/10.1145/3446804.3446843
9. Valiante E., Hernandez M., Barzegar A., Katzgraber H.G. Computational overhead of locality reduction in binary optimization problems. Computer Physics Communications. 2021; 269:108102. (In Eng.) DOI: https://doi.org/10.1016/j.cpc.2021.108102
10. Hong D.-Y., Wu J.-J., Liu Y.-P., Fu S.-Y., Hsu W.-C. Processor-Tracing Guided Region Formation in Dynamic Binary Translation. ACM Transactions on Architecture and Code Optimization. 2018; 15(4):52. (In Eng.) DOI: https://doi.org/10.1145/3281664
11. Khan T.A., Sriraman A., Devietti J., Pokam G., Litz H., Kasikci B. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Press, Athens, Greece; 2020. p. 146-159. (In Eng.) DOI: https://doi.org/10.1109/MICRO50266.2020.00024
12. Lavaee R., Criswell J., Ding C. Codestitcher: inter-procedural basic block layout optimization. Proceedings of the 28th International Conference on Compiler Construction (CC 2019). Association for Computing Machinery, New York, NY, USA; 2019. p. 65-75. (In Eng.) DOI: https://doi.org/10.1145/3302516.3307358
13. Ottoni G., Liu B. HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale. 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Seoul, Korea (South); 2021. p. 340-350. (In Eng.) DOI: https://doi.org/10.1109/CGO51591.2021.9370314
14. Sari A., Butun I. A Highly Scalable Instruction Scheduler Design based on CPU Stall Elimination. 2021 Zooming Innovation in Consumer Technologies Conference (ZINC). IEEE Press, Novi Sad, Serbia; 2021. p. 105-110. (In Eng.) DOI: https://doi.org/10.1109/ZINC52049.2021.9499298
15. Neves N., Tomás P., Roma N. Compiler-Assisted Data Streaming for Regular Code Structures. IEEE Transactions on Computers. 2021; 70(3):483-494. (In Eng.) DOI: https://doi.org/10.1109/TC.2020.2990302
16. Ying V.A., Jeffrey M.C., Sanchez D. T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, Valencia, Spain; 2020. p. 159-172. (In Eng.) DOI: https://doi.org/10.1109/ISCA45697.2020.00024
17. Gadioli D., et al. SOCRATES ‒ A seamless online compiler and system runtime autotuning framework for energy-aware applications. 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE Press, Dresden, Germany; 2018. p. 1143-1146. (In Eng.) DOI: https://doi.org/10.23919/DATE.2018.8342183
18. Lin Y. Control-Flow Integrity Enforcement with Dynamic Code Optimization. Novel Techniques in Recovering, Embedding, and Enforcing Policies for Control-Flow Integrity. Information Security and Cryptography. Springer, Cham; 2021. p. 77-94. (In Eng.) DOI: https://doi.org/10.1007/978-3-030-73141-0_5
19. Ovasapyan T.D., Knyazev P.V., Moskvin D.A. Application of Taint Analysis to Study the Safety of Software of the Internet of Things Devices Based on the ARM Architecture. Automatic Control and Computer Sciences. 2020; 54(8):834-840. (In Eng.) DOI: https://doi.org/10.3103/S0146411620080246
20. Li G., Liu L., Feng X. Accelerating GPU Computing at Runtime with Binary Optimization. 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Washington, DC, USA; 2019. p. 276-277. (In Eng.) DOI: https://doi.org/10.1109/CGO.2019.8661168
21. Zhou R., Jones T.M. Janus: Statically-Driven and Profile-Guided Automatic Dynamic Binary Parallelisation. 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Washington, DC, USA; 2019. p. 15-25. (In Eng.) DOI: https://doi.org/10.1109/CGO.2019.8661196
22. Fu S. -Y., Lin C. -M., Hong D. -Y., Liu Y. -P., Wu J. -J., Hsu W. -C. Work-in-Progress: Exploiting SIMD Capability in an ARMv7-to-ARMv8 Dynamic Binary Translator. 2018 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). IEEE Press, Turin, Italy; 2018. p. 1-3. (In Eng.) DOI: https://doi.org/10.1109/CASES.2018.8516794
23. Arif M., Zhou R., Ho H. -M., Jones T. M. Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring. 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Seoul, Korea (South); 2021. p. 103-114. (In Eng.) DOI: https://doi.org/10.1109/CGO51591.2021.9370313
24. Ottoni G. 2018. HHVM JIT: a profile-guided, region-based compiler for PHP and Hack. ACM SIGPLAN Notices. 2018; 53(4):151-165. (In Eng.) DOI: https://doi.org/10.1145/3296979.3192374
25. Ajorpaz S.M., Garza E., Jindal S., Jiménez D.A. Exploring predictive replacement policies for instruction cache and branch target buffer. Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA '18). IEEE Press; 2018. p. 519-532. (In Eng.) DOI: https://doi.org/10.1109/ISCA.2018.00050
26. Khan T.A., Brown N., Sriraman A., Soundararajan N.K. Kumar R., Devietti J., Subramoney S., Pokam G.A., Litz H., Kasikci B. Twig: Profile-Guided BTB Prefetching for Data Center Applications. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'21). Association for Computing Machinery, New York, NY, USA; 2021. p. 816-829. (In Eng.) DOI: https://doi.org/10.1145/3466752.3480124
27. Khan T.A., et al. Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, Valencia, Spain; 2021. p. 734-747. (In Eng.) DOI: https://doi.org/10.1109/ISCA52012.2021.00063
28. Zhou K., Meng X., Sai R., Mellor-Crummey J. GPA: A GPU Performance Advisor Based on Instruction Sampling. 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Press, Seoul, Korea (South); 2021. p. 115-125. (In Eng.) DOI: https://doi.org/10.1109/CGO51591.2021.9370339
29. Ashouri A.H., Killian W., Cavazos J., Palermo G., Silvano C. A Survey on Compiler Autotuning using Machine Learning. ACM Computing Surveys. 2019; 51(5):96. (In Eng.) DOI: https://doi.org/10.1145/3197978
30. Savage J., Jones T.M. HALO: post-link heap-layout optimisation. Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020). Association for Computing Machinery, New York, NY, USA; 2020. p. 94-106. (In Eng.) DOI: https://doi.org/10.1145/3368826.3377914
31. Licker N., Jones N.M. Duplo: a framework for OCaml post-link optimisation. Proceedings of the ACM on Programming Languages. 2020; 4(ICFP):98. (In Eng.) DOI: https://doi.org/10.1145/3408980
Published
2021-06-30
How to Cite
LISITSYN, Sergey Alekseevich. Collecting Profile Information Using Application Execution Traces for Static Optimizing Binary Translation. Modern Information Technologies and IT-Education, [S.l.], v. 17, n. 2, p. 369-378, june 2021. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/760>. Date accessed: 12 july 2025. doi: https://doi.org/10.25559/SITITO.17.202102.369-378.
Section
Research and development in the field of new IT and their applications