Personal Data Recognition Using a Deep Learning Model
Abstract
Protecting personal identifiable information is a crucial issue today due to individuals leaving traces of their activities on social media and various digital platforms, which can be exploited by attackers for identity theft and fraud. Consequently, there is a need to develop effective methods for personal data protection. However, recognizing personal data for protection presents a significant challenge, given the diverse nature of personal data attributes, such as names and phone numbers, which can be present in various formats like tables or unstructured texts. To address this challenge, a range of techniques are employed for personal data recognition, with rule-based algorithms being the most used approach. These algorithms enable the identification of personalized data based on predefined rules, such as regular expressions and dictionaries. Nevertheless, such algorithms may lack the flexibility required to handle complex cases effectively. An alternative method involves the use of deep learning models, which are trained on large datasets and possess the capacity to adapt to diverse forms of data. In this paper, deep learning models featuring different neural network architectures were implemented and compared against rule-based algorithms. Additionally, the feasibility of using the Large Language Model for personal data recognition was explored. The research culminated in the development of a personal data recognition method that combines Artificial Intelligence language model with rule-based algorithms, capable of identifying personal data in structured and unstructured information. This paper underscores the imperative of personal data protection and highlights the potential of Artificial Intelligence models in mitigating this issue.
References
2. Goel M., et al. Deep Learning Based Named Entity Recognition Models for Recipes. arXiv preprint arXiv:2402.17447. 2024. https://doi.org/10.48550/arXiv.2402.17447
3. Tsanda A., Bruches E. Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers. arXiv preprint arXiv:2405.07886. 2024. https://doi.org/10.48550/arXiv.2405.07886
4. Hassan M.M., et al. Rule based method of name entity recognition for matching Allah's finest names in Holy Quran. Journal of Engineering and Applied Sciences. 2018;13(10):3618-3623. http://dx.doi.org/10.3923/jeasci.2018.3618.3623
5. Tarmizi S.A., Saad S. Named Entity Recognition for Quranic Text Using Rule Based Approaches. Asia-Pacific Journal of Information Technology & Multimedia. 2022;11(2):112-122. https://doi.org/10.17576/apjitm-2022-1102-09
6. Oleksy M., et al. Automated anonymization of text documents in Polish. Procedia Computer Science. 2021;192:1323-1333. http://dx.doi.org/10.1016/j.procs.2021.08.136
7. Olatunji I., Rauch J., Katzensteiner M., Khosla M. A Review of Anonymization for Healthcare Data. Big Data. 2022. http://dx.doi.org/10.1089/big.2021.0169
8. Huang H., et al. Unlearnable examples: Making personal data unexploitable. arXiv preprint arXiv:2101.04898. 2021. https://doi.org/10.48550/arXiv.2101.04898
9. Saibene A., Assale M., Giltri M. Expert systems: Definitions, advantages and issues in medical field applications. Expert Systems with Applications. 2021;177:114900. http://dx.doi.org/10.1016/j.eswa.2021.114900
10. Saglam R.B., Nurse J.R.C., HodgesD. Personal information: Perceptions, types and evolution. Journal of Information Security and Applications. 2022;66:103163. https://doi.org/10.1016/j.jisa.2022.103163
11. Singco V.Z.V., et al. OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with Finetune Transformer Models for Long Document. International Journal of Emerging Technology and Advanced Engineering. 2023;13:47-56. http://dx.doi.org/10.46338/ijetae0223_07
12. Vukatana K. OCR and Levenshtein distance as a measure of image quality accuracy for identification documents. In: 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET). Prague, Czech Republic: IEEE Computer Society; 2022. p. 1-4. https://doi.org/10.1109/ICECET55527.2022.9872824
13. Wu Y., et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144v2. 2016. https://doi.org/10.48550/arXiv.1609.08144
14. Alshammari N., Alanazi S. The impact of using different annotation schemes on named entity recognition. Egyptian Informatics Journal. 2021;22(3):295-302. https://doi.org/10.1016/j.eij.2020.10.004
15. Meenachisundaram T., Dhanabalachandran M. Biomedical Named Entity Recognition Using the SVM Methodologies and bio Tagging Schemes. Revista de Chimie. 2021;72(4):52-64. https://doi.org/10.37358/RC.21.4.8456
16. Roy A. Recent Trends in Named Entity Recognition (NER). arXiv preprint arXiv:2101.11420. 2021. https://doi.org/10.48550/arXiv.2101.11420
17. Fisher J., Vlachos A. Merge and label: A novel neural network architecture for nested NER. arXiv preprint arXiv:1907.00464. 2019. https://doi.org/10.48550/arXiv.1907.00464
18. Fu Y., et al. Nested named entity recognition with partially-observed TreeCRFs. Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(14):12839-12847. https://doi.org/10.1609/aaai.v35i14.17519
19. Dai X., et al. An effective transition-based model for discontinuous NER. arXiv preprint arXiv: 2004.13454. 2020. https://doi.org/10.48550/arXiv.2004.13454
20. Williams C.K.I. The effect of class imbalance on Precision-Recall Curves. Neural Computation. 2021;33(4):853-857. https://doi.org/10.1162/neco_a_01362
21. Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020;404:132306. https://doi.org/10.1016/j.physd.2019.132306
22. Yu Y., et al. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation. 2019;31(7):1235-1270. https://doi.org/10.1162/neco_a_01199
23. Vaswani A., et al. Attention is all you need. arXiv preprint arXiv:1706.03762v7. 2017. https://doi.org/10.48550/arXiv.1706.03762
24. Ratinov L., Roth D. Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics, USA; 2009. p. 147-155. https://doi.org/10.3115/1596374.1596399
25. Yan R., Jiang X., Dang D. Named entity recognition by using XLNet-BiLSTM-CRF. Neural Processing Letters. 2021;53(5):3339-3356. https://doi.org/10.1007/s11063-021-10547-1
26. Xu G., et al. Sentiment analysis of comment texts based on BiLSTM. Ieee Access. 2019;7:51522-51532. https://doi.org/10.1109/ACCESS.2019.2909919
27. Devlin J., et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805v2. 2019. https://doi.org/10.48550/arXiv.1810.04805
28. Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213. 2019. https://doi.org/10.48550/arXiv.1905.07213
29. Koroteev M.V. BERT: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943. 2021. https://doi.org/10.48550/arXiv.2103.11943
30. Press O., Smith N.A., Lewis M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409v2. 2022. https://doi.org/10.48550/arXiv.2108.12409
31. Kosenko D.P., Kuratov Y.M., Zharikova D.R. Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications. Doklady Mathematics. Moscow: Pleiades Publishing. 2023;108(2):393-398. https://doi.org/10.1134/S1064562423701168
32. Shavrina T., Pisarevskaya D., Malykh V. Building a Bilingual QA-system with ruGPT-3. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science. Vol. 13217. Cham: Springer; 2021. p. 124-136. https://doi.org/10.1007/978-3-031-16500-9_11
33. Xia W., Qin C., Hazan E. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151. 2024. https://doi.org/10.48550/arXiv.2401.04151
34. Gao D., et al. FashionGPT: LLM instruction fine-tuning with multiple LoRA-adapter fusion. Knowledge-Based Systems. 2024:112043. https://doi.org/10.1016/j.knosys.2024.112043
35. Babak N.G., et al. Automatic depersonalization of confidential information. Russian Technological Journal. 2023;11(5):7-18. https://doi.org/10.32362/2500-316X-2023-11-5-7-18

This work is licensed under a Creative Commons Attribution 4.0 International License.
Publication policy of the journal is based on traditional ethical principles of the Russian scientific periodicals and is built in terms of ethical norms of editors and publishers work stated in Code of Conduct and Best Practice Guidelines for Journal Editors and Code of Conduct for Journal Publishers, developed by the Committee on Publication Ethics (COPE). In the course of publishing editorial board of the journal is led by international rules for copyright protection, statutory regulations of the Russian Federation as well as international standards of publishing.
Authors publishing articles in this journal agree to the following: They retain copyright and grant the journal right of first publication of the work, which is automatically licensed under the Creative Commons Attribution License (CC BY license). Users can use, reuse and build upon the material published in this journal provided that such uses are fully attributed.
