Stochastic Variational Inequality Method for Fine-Tuning Transformer Models

Abstract

Modern transformers (BERT, RoBERTa) are widely used in natural language processing tasks, but during fine-tuning—especially when partially unfreezing layers—classic stochastic gradient descent (SGD)-based optimizers face problems: non-smooth activations and vanishing gradients slow down and destabilize training. This work examines a method based on the stochastic variational inequality (SVI), which does not depend on activation derivatives, as a more robust and faster way to fine-tune transformers. Evaluation is performed on the binary sentiment classification task (SST-2).
For experiments, a pre-trained BERT model and the SST-2 dataset are used. A comparison between classic SGD and the SVI-based method is carried out in two fine-tuning modes: updating only the classifier, and updating the classifier together with several top model layers. Performance is measured by accuracy on the training and validation sets over multiple runs to account for instability.
SVI shows a notable advantage over SGD in both modes: when updating only the classifier, SVI yields substantially higher validation accuracy, and when unfreezing the top layers the advantage increases further. Additionally, SVI provides faster and more stable convergence and lower run-to-run variability.
SVI demonstrates significant benefits for fine-tuning transformers: improved accuracy, stable convergence, and low result variance. The advantage is linked to the lack of dependence on activation derivatives and, consequently, the avoidance of vanishing-gradient issues. Thus, SVI appears to be a promising solution for fast and reliable fine-tuning of large language models. Main limitations of the approach are the need for additional implementation effort and limited support in existing deep-learning frameworks.

Author Biographies

Ivan Vladimirovich Sharun, Omsk State Technical University

Senior Lecturer of the Chair of Applied Mathematics and Fundamental Informatics, Information Technology and Computer Systems Faculty

Vladimir Petrovich Todarenko, Omsk State Technical University

Master Degree Student of the Chair of Applied Mathematics and Fundamental Informatics, Information Technology and Computer Systems Faculty

Anna Vladimirovna Zykina, Omsk State Technical University

Head of the Chair of Applied Mathematics and Fundamental Informatics, Information Technology and Computer Systems Faculty, Dr. Sci. (Phys.-Math.), Professor

Published
2025-12-29
How to Cite
SHARUN, Ivan Vladimirovich; TODARENKO, Vladimir Petrovich; ZYKINA, Anna Vladimirovna. Stochastic Variational Inequality Method for Fine-Tuning Transformer Models. Modern Information Technologies and IT-Education, [S.l.], v. 21, n. 4, dec. 2025. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/1274>. Date accessed: 31 may 2026.
Section
Scientific software in education and science