Features for Forming Text Corpus of Kazakhstan Electronic News

Abstract

The culture of online-news consumption continues to take shape and is gaining popularity, increasing the audience of readers. At the same time, the number of those who fall under the negative influence of false news is growing. Researchers are faced with the task of analyzing mass media. One of the areas of news content analysis is thematic modelling, recognition of fake news, sentiment analysis. However, to research these areas, there is a need in a labelled corpus.
This paper presents the methodological foundations of the corpus formation. It describes the process of data collection and the selection of sources to form the corpus. It also presents a description of the theoretical foundations of representativeness and balance and explains compliance of the corpus with the requirements. In the course of the composite work, authors gained a corpus of 1.9 million news texts from 22 news sources. They conducted corpus markup and carried-up the analysis of the thematic structure of the formed corps using the LDA model.
The formed corpus will allow testing machine learning algorithms aimed at recognizing individual informative features and identifying patterns that are present in the array of news publications. Also, the corpus will be useful to machine learning and NLP researchers to test machine learning algorithms according to their own goals.

Author Biographies

Ulzhan Abaevna Ospanova, “Information-Analytical Center”, JSC

Project Manager of the Department of Applied Research and Development, Master of Management

Mukhit Abilkasymovich Baimakhanbetov, “Information-Analytical Center”, JSC

Chief Analyst of the Department of Applied Research and Development

Inessa Georgievna Akoyeva, "Information-Analytical Center", JSC

Chief Analyst of the Department of Applied Research and Development

Timur Kerimbekovich Buldybayev, “Information-Analytical Center”, JSC

Director of the Department of Applied Research and Development

Miraim Kazhmukhambetovna Atanayeva, “Information-Analytical Center”, JSC

Acting President of the “Information-Analytical Center”, JSC; Master of Public Administration

References

[1] Newman N., Fletcher R., Kalogeropoulos A., Levy D. A. L., Nielsen R.K. Digital News Report 2017. Reuters Institute for the Study of Journalism, Oxford, UK; 2017. Available at: https://reutersinstitute.politics.ox.ac.uk/sites/default/files/Digital%20News%20Report%202017%20web_0.pdf (accessed 17.12.2019). (In Eng.).
[2] Jang S.M., Jang S.M., Geng T., Li J.-Y.Q., Xia R., Huang C.-T., Kim H., Tang J. A computational approach for examining the roots and spreading patterns of fake news: Evolution tree analysis. Computers in Human Behavior.2018; 84:103-113. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2018.02.032
[3] Colliander J. "This is fake news": Investigating the role of conformity to other users” views when commenting on and spreading disinformation in social media. Computers in Human Behavior. 2019; 97:202-215. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2019.3.32
[4] Vasoughi S., Roy D., Aral S. The spread of true and false news online. Science. 2018; 359(6380):1146-1151. (In Eng.) DOI: https://doi.org/10.1126/science.aap9559
[5] Baranov A.N. Vvedenie v prikladnuyu lingvistiku [Introduction to Applied Linguistics] Editorial URSS Publ, Moscow; 2003. p.118. (In Russ.).
[6] Biber D. Representativeness in Corpus Design. In: A. Zampolli, N. Calzolari, M. Palmer (ed.) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol.9. Springer, Dordrecht; 1994. p. 377-407. (In Eng.) DOI: https://doi.org/10.1007/978-0-585-35958-8_20
[7] Gries S.Th. Exploring variability within and between corpora: some methodological considerations. Corpora. 2006; 1(2):109-151. (In Eng.) DOI: https://doi.org/10.3366/cor.2006.1.2.109
[8] Gries S.Th. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics. 2008; 13(4):403-437. (In Eng.) DOI: https://doi.org/10.1075/ijcl.13.4.02gri
[9] Leech G. New Resources, or Just Better Old Ones? The Holy Grail of Representativeness. In: M. Hundt, N. Nesselhauf, C. Biewer (ed.) Corpus Linguistics and the Web. Rodopi, Amsterdam; 2007. p. 133-149. (In Eng.).
[10] Váradi T. The Linguistic Relevance of Corpus Linguistics. In: P. Rayson, A. Wilson, T. McEnery, A. Hardie, Sh. Khoja (ed.) Proceedings of the Corpus Linguistics 2001 Conference. UCREL Technical Papers, Lancaster University, UK; 2001. no.13. pp. 587-593. Available at: http://ucrel.lancs.ac.uk/publications/CL2003/CL2001%20conference/papers/varadi.pdf (accessed 17.12.2019). (In Eng.).
[11] Sinclair J. Corpus, Concordance, Collocation. Oxford, UK: Oxford University Press; 1991. (In Eng.).
[12] Hanks P. The Corpus Revolution in Lexicography. International Journal of Lexicography. 2012; 25(4):398-436. (In Eng.) DOI: https://doi.org/10.1093/ijl/ecs026
[13] McEnery T., Xiao R., Tono Y. Corpus-Based Language Studies: An Advanced Resource Book. New York: Routledge; 2006. (In Eng.).
[14] Xiao Z., McEnery A. Situation Aspect as a Universal Aspect: Implications for Artificial Languages. Journal of Universal Language. 2002; 3(2):139-177. Available at: https://www.sejongjul.org/archive/view_article?pid=jul-3-2-139 (accessed 17.12.2019). (In Eng.).
[15] Egbert J., Gray B., Biber D. Designing and evaluating language corpora. Cambridge: Cambridge University Press; 2017. (In Eng.).
[16] Lüdeling A., Kytö M. Corpus Linguistics. vol. 1. Walter de Gruyter, Berlin; 2008. (In Eng.)
[17] Oeldorf-Hirsch A., Sundar S.S. Posting, commenting, and tagging: Effects of sharing news stories on Facebook. Computers in Human Behavior. 2015; 44:240-249. (In Eng.) DOI: https://doi.org/10.1016/j.chb.2014.11.24
[18] Liu Q., Zhou M., Zhao X. Understanding News 2.0: A framework for explaining the number of comments from readers on online news. Information & Management. 2015; 52(7):764-776. (In Eng.) DOI: https://doi.org/10.1016/j.im.2015.1.2
[19] Tsagkias M., Weerkamp W., de Rijke M. News Comments: Exploring, Modeling, and Online Prediction. In: C. Gurrin et al. (ed.) Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol.5993. Springer, Berlin, Heidelberg; 2010. p. 191-203. (In Eng.) DOI: https://doi.org/10.1007/978-3-642-12275-0_19
[20] Chung D.S. Interactive Features of Online Newspapers: Identifying Patterns and Predicting Use of Engaged Readers. Journal of Computer-Mediated Communication. 2008; 13(3):658-679. (In Eng.) DOI: https://doi.org/10.1111/j.1083-6101.2008.414.x
[21] Atanayeva M.K., Buldybayev T.K., Ospanova U.A., Akoyeva I.G., Nurumov K.S., Baimahanbetov M.A. Determination of the sentiment and objectivity of news texts vocabulary approach. Nauchnyi aspekt. 2019; 3(3):296-308. Available at: https://elibrary.ru/item.asp?id=41388548 (accessed 17.12.2019). (In Russ.).
[22] Hansen L.K., Arvidsson A., Nielsen F.A., Colleoni E., Etter M. Good Friends, Bad News - Affect and Virality in Twitter. In: J.J. Park, L.T. Yang, C. Lee (ed.) Future Information Technology. Communications in Computer and Information Science, vol.185. Springer, Berlin, Heidelberg; 2011. p. 34-43. (In Eng.) DOI: https://doi.org/10.1007/978-3-642-22309-9_5
[23] Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003; 3:993-1022. Available at: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (accessed 17.12.2019). (In Eng.).
[24] Vorontsov K., Potapenko A. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. In: D. Ignatov, M. Khachay, A. Panchenko, N. Konstantinova, R. Yavorsky (ed.) Analysis of Images, Social Networks and Texts. AIST 2014. Communications in Computer and Information Science, vol.436. Springer, Cham; 2014. p. 29-46. (In Eng.) DOI: https://doi.org/10.1007/978-3-319-12580-0_3
[25] Korenčić D., Ristov S., Najder J.E. Document-based Topic Coherence Measures for News Media Text. Expert Systems with Applications. 2018; 114:357-373. (In Eng.) DOI: https://doi.org/10.1016/j.eswa.2018.7.63
Published
2020-05-25
How to Cite
OSPANOVA, Ulzhan Abaevna et al. Features for Forming Text Corpus of Kazakhstan Electronic News. Modern Information Technologies and IT-Education, [S.l.], v. 16, n. 1, p. 90-98, may 2020. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/612>. Date accessed: 30 oct. 2025. doi: https://doi.org/10.25559/SITITO.16.202001.90-98.
Section
Research and development in the field of new IT and their applications