Распознавание персональных данных с помощью модели глубокого обучения

Nikita Grigorievich Babak

doi:10.25559/SITITO.020.202401.13-26

Nikita Grigorievich Babak Национальный исследовательский университет "МЭИ"; ПАО Сбербанк http://orcid.org/0000-0001-7129-1018

DOI: https://doi.org/10.25559/SITITO.020.202401.13-26

Аннотация

Защита персональных данных является актуальной проблемой в современном мире, поскольку люди оставляют следы своей активности в социальных сетях и других цифровых платформах. Эти данные могут быть использованы злоумышленниками для кражи личной информации и мошенничества. Поэтому важно разрабатывать методы защиты персональных данных. Однако, распознавание персональных данных для их защиты является сложной задачей, так как существует множество различных атрибутов персональных данных, например, фамилии и номера телефонов, эти данные могут быть представлены в разных форматах, например, в виде таблиц или неструктурированных текстов. Для решения этой задачи используются различные методы распознавания персональных данных, наиболее распространённый из которых – алгоритмы на основе правил. Они позволяют определить, какие данные являются персональными, основываясь на заранее определенных правилах, таких как регулярные выражения и словари. Однако, такие алгоритмы могут быть недостаточно гибкими и не всегда способны обрабатывать сложные случаи. Другой метод заключается в использовании моделей глубокого обучения, которые обучаются на больших объёмах данных и могут лучше адаптироваться к различным данным. В рамках данной работы реализованы модели глубокого обучения с различной архитектурой нейросетей и проведено их сравнение с алгоритмами на основе правил. Также проведено исследование возможности использования большой языковой модели для распознавания персональных данных. В результате проведённого исследования реализован метод распознавания персональных данных, сочетающий в себе языковую модель искусственного интеллекта и алгоритмы на основе правил и способный распознавать персональные данные как в структурированной, так и в неструктурированной информации. Данная работа демонстрирует необходимость в защите персональных данных и возможность использования моделей искусственного интеллекта для решения этой задачи.

Сведения об авторе

Nikita Grigorievich Babak, Национальный исследовательский университет "МЭИ"; ПАО Сбербанк

аспирант кафедры вычислительных машин, систем и сетей Института информационных и вычислительных технологий; главный эксперт по защите данных, Департамент кибербезопасности

Литература

1. Li J., et al. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering. 2020;34(1):50-70. http://dx.doi.org/10.1109/TKDE.2020.2981314
2. Goel M., et al. Deep Learning Based Named Entity Recognition Models for Recipes. arXiv preprint arXiv:2402.17447. 2024. https://doi.org/10.48550/arXiv.2402.17447
3. Tsanda A., Bruches E. Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers. arXiv preprint arXiv:2405.07886. 2024. https://doi.org/10.48550/arXiv.2405.07886
4. Hassan M.M., et al. Rule based method of name entity recognition for matching Allah's finest names in Holy Quran. Journal of Engineering and Applied Sciences. 2018;13(10):3618-3623. http://dx.doi.org/10.3923/jeasci.2018.3618.3623
5. Tarmizi S.A., Saad S. Named Entity Recognition for Quranic Text Using Rule Based Approaches. Asia-Pacific Journal of Information Technology & Multimedia. 2022;11(2):112-122. https://doi.org/10.17576/apjitm-2022-1102-09
6. Oleksy M., et al. Automated anonymization of text documents in Polish. Procedia Computer Science. 2021;192:1323-1333. http://dx.doi.org/10.1016/j.procs.2021.08.136
7. Olatunji I., Rauch J., Katzensteiner M., Khosla M. A Review of Anonymization for Healthcare Data. Big Data. 2022. http://dx.doi.org/10.1089/big.2021.0169
8. Huang H., et al. Unlearnable examples: Making personal data unexploitable. arXiv preprint arXiv:2101.04898. 2021. https://doi.org/10.48550/arXiv.2101.04898
9. Saibene A., Assale M., Giltri M. Expert systems: Definitions, advantages and issues in medical field applications. Expert Systems with Applications. 2021;177:114900. http://dx.doi.org/10.1016/j.eswa.2021.114900
10. Saglam R.B., Nurse J.R.C., HodgesD. Personal information: Perceptions, types and evolution. Journal of Information Security and Applications. 2022;66:103163. https://doi.org/10.1016/j.jisa.2022.103163
11. Singco V.Z.V., et al. OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with Finetune Transformer Models for Long Document. International Journal of Emerging Technology and Advanced Engineering. 2023;13:47-56. http://dx.doi.org/10.46338/ijetae0223_07
12. Vukatana K. OCR and Levenshtein distance as a measure of image quality accuracy for identification documents. In: 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET). Prague, Czech Republic: IEEE Computer Society; 2022. p. 1-4. https://doi.org/10.1109/ICECET55527.2022.9872824
13. Wu Y., et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144v2. 2016. https://doi.org/10.48550/arXiv.1609.08144
14. Alshammari N., Alanazi S. The impact of using different annotation schemes on named entity recognition. Egyptian Informatics Journal. 2021;22(3):295-302. https://doi.org/10.1016/j.eij.2020.10.004
15. Meenachisundaram T., Dhanabalachandran M. Biomedical Named Entity Recognition Using the SVM Methodologies and bio Tagging Schemes. Revista de Chimie. 2021;72(4):52-64. https://doi.org/10.37358/RC.21.4.8456
16. Roy A. Recent Trends in Named Entity Recognition (NER). arXiv preprint arXiv:2101.11420. 2021. https://doi.org/10.48550/arXiv.2101.11420
17. Fisher J., Vlachos A. Merge and label: A novel neural network architecture for nested NER. arXiv preprint arXiv:1907.00464. 2019. https://doi.org/10.48550/arXiv.1907.00464
18. Fu Y., et al. Nested named entity recognition with partially-observed TreeCRFs. Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(14):12839-12847. https://doi.org/10.1609/aaai.v35i14.17519
19. Dai X., et al. An effective transition-based model for discontinuous NER. arXiv preprint arXiv: 2004.13454. 2020. https://doi.org/10.48550/arXiv.2004.13454
20. Williams C.K.I. The effect of class imbalance on Precision-Recall Curves. Neural Computation. 2021;33(4):853-857. https://doi.org/10.1162/neco_a_01362
21. Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020;404:132306. https://doi.org/10.1016/j.physd.2019.132306
22. Yu Y., et al. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation. 2019;31(7):1235-1270. https://doi.org/10.1162/neco_a_01199
23. Vaswani A., et al. Attention is all you need. arXiv preprint arXiv:1706.03762v7. 2017. https://doi.org/10.48550/arXiv.1706.03762
24. Ratinov L., Roth D. Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics, USA; 2009. p. 147-155. https://doi.org/10.3115/1596374.1596399
25. Yan R., Jiang X., Dang D. Named entity recognition by using XLNet-BiLSTM-CRF. Neural Processing Letters. 2021;53(5):3339-3356. https://doi.org/10.1007/s11063-021-10547-1
26. Xu G., et al. Sentiment analysis of comment texts based on BiLSTM. Ieee Access. 2019;7:51522-51532. https://doi.org/10.1109/ACCESS.2019.2909919
27. Devlin J., et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805v2. 2019. https://doi.org/10.48550/arXiv.1810.04805
28. Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213. 2019. https://doi.org/10.48550/arXiv.1905.07213
29. Koroteev M.V. BERT: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943. 2021. https://doi.org/10.48550/arXiv.2103.11943
30. Press O., Smith N.A., Lewis M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409v2. 2022. https://doi.org/10.48550/arXiv.2108.12409
31. Kosenko D.P., Kuratov Y.M., Zharikova D.R. Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications. Doklady Mathematics. Moscow: Pleiades Publishing. 2023;108(2):393-398. https://doi.org/10.1134/S1064562423701168
32. Shavrina T., Pisarevskaya D., Malykh V. Building a Bilingual QA-system with ruGPT-3. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science. Vol. 13217. Cham: Springer; 2021. p. 124-136. https://doi.org/10.1007/978-3-031-16500-9_11
33. Xia W., Qin C., Hazan E. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151. 2024. https://doi.org/10.48550/arXiv.2401.04151
34. Gao D., et al. FashionGPT: LLM instruction fine-tuning with multiple LoRA-adapter fusion. Knowledge-Based Systems. 2024:112043. https://doi.org/10.1016/j.knosys.2024.112043
35. Babak N.G., et al. Automatic depersonalization of confidential information. Russian Technological Journal. 2023;11(5):7-18. https://doi.org/10.32362/2500-316X-2023-11-5-7-18