Features for Forming Text Corpus of Kazakhstan Electronic News
Abstract
The culture of online-news consumption continues to take shape and is gaining popularity, increasing the audience of readers. At the same time, the number of those who fall under the negative influence of false news is growing. Researchers are faced with the task of analyzing mass media. One of the areas of news content analysis is thematic modelling, recognition of fake news, sentiment analysis. However, to research these areas, there is a need in a labelled corpus.
This paper presents the methodological foundations of the corpus formation. It describes the process of data collection and the selection of sources to form the corpus. It also presents a description of the theoretical foundations of representativeness and balance and explains compliance of the corpus with the requirements. In the course of the composite work, authors gained a corpus of 1.9 million news texts from 22 news sources. They conducted corpus markup and carried-up the analysis of the thematic structure of the formed corps using the LDA model.
The formed corpus will allow testing machine learning algorithms aimed at recognizing individual informative features and identifying patterns that are present in the array of news publications. Also, the corpus will be useful to machine learning and NLP researchers to test machine learning algorithms according to their own goals.
References
[2] Jang S.M., Jang S.M., Geng T., Li J.-Y.Q., Xia R., Huang C.-T., Kim H., Tang J. A computational approach for examining the roots and spreading patterns of fake news: Evolution tree analysis. Computers in Human Behavior.2018; 84:103-113. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2018.02.032
[3] Colliander J. "This is fake news": Investigating the role of conformity to other users” views when commenting on and spreading disinformation in social media. Computers in Human Behavior. 2019; 97:202-215. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2019.3.32
[4] Vasoughi S., Roy D., Aral S. The spread of true and false news online. Science. 2018; 359(6380):1146-1151. (In Eng.) DOI: https://doi.org/10.1126/science.aap9559
[5] Baranov A.N. Vvedenie v prikladnuyu lingvistiku [Introduction to Applied Linguistics] Editorial URSS Publ, Moscow; 2003. p.118. (In Russ.).
[6] Biber D. Representativeness in Corpus Design. In: A. Zampolli, N. Calzolari, M. Palmer (ed.) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol.9. Springer, Dordrecht; 1994. p. 377-407. (In Eng.) DOI: https://doi.org/10.1007/978-0-585-35958-8_20
[7] Gries S.Th. Exploring variability within and between corpora: some methodological considerations. Corpora. 2006; 1(2):109-151. (In Eng.) DOI: https://doi.org/10.3366/cor.2006.1.2.109
[8] Gries S.Th. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics. 2008; 13(4):403-437. (In Eng.) DOI: https://doi.org/10.1075/ijcl.13.4.02gri
[9] Leech G. New Resources, or Just Better Old Ones? The Holy Grail of Representativeness. In: M. Hundt, N. Nesselhauf, C. Biewer (ed.) Corpus Linguistics and the Web. Rodopi, Amsterdam; 2007. p. 133-149. (In Eng.).
[10] Váradi T. The Linguistic Relevance of Corpus Linguistics. In: P. Rayson, A. Wilson, T. McEnery, A. Hardie, Sh. Khoja (ed.) Proceedings of the Corpus Linguistics 2001 Conference. UCREL Technical Papers, Lancaster University, UK; 2001. no.13. pp. 587-593. Available at: http://ucrel.lancs.ac.uk/publications/CL2003/CL2001%20conference/papers/varadi.pdf (accessed 17.12.2019). (In Eng.).
[11] Sinclair J. Corpus, Concordance, Collocation. Oxford, UK: Oxford University Press; 1991. (In Eng.).
[12] Hanks P. The Corpus Revolution in Lexicography. International Journal of Lexicography. 2012; 25(4):398-436. (In Eng.) DOI: https://doi.org/10.1093/ijl/ecs026
[13] McEnery T., Xiao R., Tono Y. Corpus-Based Language Studies: An Advanced Resource Book. New York: Routledge; 2006. (In Eng.).
[14] Xiao Z., McEnery A. Situation Aspect as a Universal Aspect: Implications for Artificial Languages. Journal of Universal Language. 2002; 3(2):139-177. Available at: https://www.sejongjul.org/archive/view_article?pid=jul-3-2-139 (accessed 17.12.2019). (In Eng.).
[15] Egbert J., Gray B., Biber D. Designing and evaluating language corpora. Cambridge: Cambridge University Press; 2017. (In Eng.).
[16] Lüdeling A., Kytö M. Corpus Linguistics. vol. 1. Walter de Gruyter, Berlin; 2008. (In Eng.)
[17] Oeldorf-Hirsch A., Sundar S.S. Posting, commenting, and tagging: Effects of sharing news stories on Facebook. Computers in Human Behavior. 2015; 44:240-249. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2014.11.24
[18] Liu Q., Zhou M., Zhao X. Understanding News 2.0: A framework for explaining the number of comments from readers on online news. Information & Management. 2015; 52(7):764-776. (In Eng.) DOI: https://doi.org/10.1016/j.im.2015.1.2
[19] Tsagkias M., Weerkamp W., de Rijke M. News Comments: Exploring, Modeling, and Online Prediction. In: C. Gurrin et al. (ed.) Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol.5993. Springer, Berlin, Heidelberg; 2010. p. 191-203. (In Eng.) DOI: https://doi.org/10.1007/978-3-642-12275-0_19
[20] Chung D.S. Interactive Features of Online Newspapers: Identifying Patterns and Predicting Use of Engaged Readers. Journal of Computer-Mediated Communication. 2008; 13(3):658-679. (In Eng.) DOI: https://doi.org/10.1111/j.1083-6101.2008.414.x
[21] Atanayeva M.K., Buldybayev T.K., Ospanova U.A., Akoyeva I.G., Nurumov K.S., Baimahanbetov M.A. Determination of the sentiment and objectivity of news texts vocabulary approach. Nauchnyi aspekt. 2019; 3(3):296-308. Available at: https://elibrary.ru/item.asp?id=41388548 (accessed 17.12.2019). (In Russ.).
[22] Hansen L.K., Arvidsson A., Nielsen F.A., Colleoni E., Etter M. Good Friends, Bad News - Affect and Virality in Twitter. In: J.J. Park, L.T. Yang, C. Lee (ed.) Future Information Technology. Communications in Computer and Information Science, vol.185. Springer, Berlin, Heidelberg; 2011. p. 34-43. (In Eng.) DOI: https://doi.org/10.1007/978-3-642-22309-9_5
[23] Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003; 3:993-1022. Available at: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (accessed 17.12.2019). (In Eng.).
[24] Vorontsov K., Potapenko A. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. In: D. Ignatov, M. Khachay, A. Panchenko, N. Konstantinova, R. Yavorsky (ed.) Analysis of Images, Social Networks and Texts. AIST 2014. Communications in Computer and Information Science, vol.436. Springer, Cham; 2014. p. 29-46. (In Eng.) DOI: https://doi.org/10.1007/978-3-319-12580-0_3
[25] Korenčić D., Ristov S., Najder J.E. Document-based Topic Coherence Measures for News Media Text. Expert Systems with Applications. 2018; 114:357-373. (In Eng.) DOI: https://doi.org/10.1016/j.eswa.2018.7.63

This work is licensed under a Creative Commons Attribution 4.0 International License.
Publication policy of the journal is based on traditional ethical principles of the Russian scientific periodicals and is built in terms of ethical norms of editors and publishers work stated in Code of Conduct and Best Practice Guidelines for Journal Editors and Code of Conduct for Journal Publishers, developed by the Committee on Publication Ethics (COPE). In the course of publishing editorial board of the journal is led by international rules for copyright protection, statutory regulations of the Russian Federation as well as international standards of publishing.
Authors publishing articles in this journal agree to the following: They retain copyright and grant the journal right of first publication of the work, which is automatically licensed under the Creative Commons Attribution License (CC BY license). Users can use, reuse and build upon the material published in this journal provided that such uses are fully attributed.
