Application of Probability-Entropy Approach to the Selection of Thematically Similar Documents in the Information System Military Administration

Abstract

The task of finding thematically similar documents, as one of the tasks of text classification, is one of the most important areas of natural language processing. As a result of solving this problem, the data is automatically sorted according to a predefined set of classes. The search for thematically similar documents and text classification is widely used in commercial applications such as spam filtering, decision-making, extracting information from raw data, and many other applications. In special-purpose information systems, automatic text classification is used to process information messages from open sources of information, eliminating the need to use a more expensive and time-consuming manual data classification mechanism.
Currently, the best results in automatic text classification are shown by methods based on neural networks. However, it should be taken into account that such results were obtained on test sets containing tens and hundreds of thousands of marked-up documents and under conditions of a constant set of classes. The article offers a method of selection of thematically similar documents, which is based on a reference set of several dozens of documents related to each specific class. The reference set of documents is presented as a ranked list of keywords and phrases (list of key terms). The place of a term in this list (the rank of a term) is determined by calculating several probabilistic-entropy indicators and subsequent summation. Next, proximity to each class is determined based on the number of key terms in each class and the final weight in the document to be classified.

Author Biographies

Vladimir Alexandrovich Popov, Military Academy of the Strategic Missile Forces named after Peter the Great

Department Adjunct

Dmitry Vladimirovich Krakhmalev, Financial University under the Government of the Russian Federation

Associate Professor of the Department of Business Informatics, Cand. Sci. (Eng.), Associate Professor

Mikhail Sergeevich Chipchagov, Financial University under the Government of the Russian Federation

Associate Professor of the Department of Data Analysis and Machine Learning, Cand. Sci. (Eng.)

References

1. Loper E., Bird S. NLTK: the Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics ‒ Vol. 1 (ETMTNLP '02). USA: Association for Computational Linguistics; 2002. p. 63-70. doi: https://doi.org/10.3115/1118108.1118117
2. Kamruzzaman S.M. Text classification using artificial intelligence. Journal of Electrical Engineering. 2006;EE 33(I). Available at: https://arxiv.org/ftp/arxiv/papers/1009/1009.4964.pdf (accessed 04.10.2022).
3. Artemenko V.B., Bezdenezhnyh I.V., Vasiletskiy Yu.L. Military Scientific Information System of the Armed Forces of the Russian Federation: Yesterday, Today, Tomorrow. Armament and Economics. 2022;(3):143-158. Available at: https://elibrary.ru/zgtfxo (accessed 04.10.2022). (In Russ., abstract in Eng.)
4. Gorobets E.A., Mamontova A.V. Algorithm of automatic search for non-standard vocabulary units when creating a comprehnsive dictionary. Current Issues in Philology and Pedagogical Linguistics. 2022;(2):131-142. (In Russ., abstract in Eng.) doi: https://doi.org/10.29025/2079-6021-2022-2-131-142
5. Sorokin A.B., Kushnarev A.P. Morphological Text Analyzer for Revealing the Completeness of Information. Informacionnye tehnologii = Information Technologies. 2018;24(11):719-724. (In Russ., abstract in Eng.) doi: https://doi.org/10.17587/it.24.719-724
6. Zhidkov R.E., Viktorov D.S., Zhidkov E.N. An information technology for verifying special software of military automated systems. Software & Systems. 2019;32(2):283-289. (In Russ., abstract in Eng.) doi: https://doi.org/10.15827/0236-235X.126.283-289
7. Narmadha D., NaveenSundar G., Geetha S. A novel approach to prune mined association rules in large databases. In: 2011 3rd International Conference on Electronics Computer Technology. Kanyakumari, India: IEEE Computer Society; 2011. p. 409-413. doi: https://doi.org/10.1109/ICECTECH.2011.5942031
8. Rahman C.M., Sohel F.A., Naushad P., Kamruzzaman S.M. Text classification using the concept of association rule of data mining. In: Proceedings of International Conference on Information Technology. Kathmandu, Nepal; 2003. p. 234-241. doi: https://doi.org/10.48550/arXiv.1009.4582
9. Kamruzzaman S.M., Farhana Haider. A Hybrid Learning Algorithm for Text Classification. In: Proceedings of the 3rd International Conference on Electrical & Computer Engineering (ICECE 2004). Dhaka Bangladesh; 2004. p. 577-580. doi: https://doi.org/10.48550/arXiv.1009.4574
10. Moloshnikov I.A., Sboev V.G., Gudovskikh D.V. Probabilistic-entropic algorythm of contextual semantic graph construction for selection of thematically similar texts. Proceedings of Voronezh State University. Series: Linguistics and Intercultural Communication. 2015;(3):64-70. Available at: https://www.elibrary.ru/item.asp?id=25942829 (accessed 04.10.2022). (In Russ., abstract in Eng.)
11. Loukachevitch N.V., Chetviorkin I.I. Construction of a Model for the Cross-domain opinion word extraction. Modeling and Analysis of Information Systems. 2013;20(2):70-79. Available at: https://elibrary.ru/item.asp?id=19544642 (accessed 04.10.2022). (In Russ., abstract in Eng.)
12. da Silva N., Chrishman R. The role of frames in the organization of online dictionaries. Calidoscopio. 2018;16(3):450-459. doi: https://doi.org/10.4013/cld.2018.163.09
13. Gareev R., Tkachenko M., Solovyev V., Simanovsky A., Ivanov V. Introducing Baselines for Russian Named Entity Recognition. In: Gelbukh A. (ed.) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science. Vol. 7816. Berlin, Heidelberg: Springer; 2013. p. 329-342. doi: https://doi.org/10.1007/978-3-642-37247-6_27
14. Panchendrarajan R., Amaresan A. Bidirectional LSTM-CRF for Named Entity Recognition. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Hong Kong: Association for Computational Linguistics; 2018. p. 531-540. Available at: https://aclanthology.org/Y18-1061.pdf (accessed 04.10.2022).
15. Kuzmenko E. Morphological Analysis for Russian: Integration and Comparison of Taggers. In: Ignatov D.I., et al. (eds.) Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Computer and Information Science. Vol. 661. Cham: Springer; 2017. p. 162-171. doi: https://doi.org/10.1007/978-3-319-52920-2_16
16. Cinque M., Corte R.D., Pecchia A. Entropy-Based Security Analytics: Measurements from a Critical Information System. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Denver, CO, USA: IEEE Computer Society; 2017. p. 379-390. doi: https://doi.org/10.1109/DSN.2017.39
17. Koltcov S. Application of Rényi and Tsallis entropies to topic modeling optimization. Physica A: Statistical Mechanics and its Applications. 2018;512:1192-1204. doi: https://doi.org/10.1016/j.physa.2018.08.050
18. Yin C., Xi J. Maximum entropy model for mobile text classification in cloud computing using improved information gain algorithm. Multimedia Tools and Applications. 2017;76(16):16875-16891. doi: https://doi.org/10.1007/s11042-016-3545-5
19. Artemenko V.B., Bezdenezhnykh I.V. Prospective Areas for Development of the Military-Scientific Information System. Scientific and Technical Information Processing. 2021;48(1):58-69. doi: https://doi.org/10.3103/S0147688221010093
20. Moloshnikov I.A., Sboev A.G., Rybka R.B., Gudovskikh D.V. Complex of Probabilistic-Entropy and Intelligent Algorithms for Emotiveness-Thematic Analysis of the Evolution of Public Opinion in the Internet Network. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2016;5(3):262-267. (In Russ., abstract in Eng.) doi: https://doi.org/10.1134/S2304487X16030081
21. Moloshnikov I.A., Rybka R.B., Sboev A.G., Gudovskikh D.V., Ivanov I.I. Two-level model of a deep neural network for the morphological analysis of Russian sentences. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2017;6(6):555-562. (In Russ., abstract in Eng.) doi: https://doi.org/10.1134/S2304487X17060086
22. Hou J., Wang R., Wang J., Yang Z., He D. The System Adaptability Evaluation Index System of Military Communication Equipment System. In: Proceedings of the 5th International Conference on Computer Science and Application Engineering (CSAE '21). New York, NY, USA: Association for Computing Machinery; 2021. Article number: 4. doi: https://doi.org/10.1145/3487075.3487079
23. Dien T.T., Loc B.H., Thai-Nghe N. Article Classification using Natural Language Processing and Machine Learning. In: 2019 International Conference on Advanced Computing and Applications (ACOMP). Nha Trang, Vietnam: IEEE Computer Society; 2019. p. 78-84. doi: https://doi.org/10.1109/ACOMP.2019.00019
24. Ashrafi M.Z., Taniar D., Smith K. A New Approach of Eliminating Redundant Association Rules. In: Galindo F., Takizawa M., Traunmüller R. (eds.) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science. Vol. 3180. Berlin, Heidelberg: Springer; 2004. p. 465-474. doi: https://doi.org/10.1007/978-3-540-30075-5_45
25. Kowsari K., Meimandi K.J., Heidarysafa M., Mendu S., Barnes L., Brown D. Text Classification Algorithms: A Survey. Information. 2019;10(4):150. doi: https://doi.org/10.3390/info10040150
Published
2022-12-20
How to Cite
POPOV, Vladimir Alexandrovich; KRAKHMALEV, Dmitry Vladimirovich; CHIPCHAGOV, Mikhail Sergeevich. Application of Probability-Entropy Approach to the Selection of Thematically Similar Documents in the Information System Military Administration. Modern Information Technologies and IT-Education, [S.l.], v. 18, n. 4, p. 821-828, dec. 2022. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/902>. Date accessed: 29 oct. 2025. doi: https://doi.org/10.25559/SITITO.18.202204.821-828.

Most read articles by the same author(s)