Developing a Web Scraping Application with Bypass Blocking
Abstract
Web-scraping is a process of extracting data from web-pages on the Internet by automating web-sites requests. Importance of web-scraping is increased with developing of the Internet. And more than half of Internet traffic (except for streaming, i.e. audio and video) is created by automated means, so-called bots.
TThe article is devoted to the study of the process of web-scraping and the problem of blocking web scrapers on the Internet. We consider the basic principles and concepts of web scraping process and classification of web scrapers. A review of existing web-scraping solutions is carried out, highlighting the main advantages and disadvantages of web-scraping bypassing locks. The reasons for blocking web scrapers by websites are considered, highlighting the signs by which websites determine and block web scrapers. We investigate techniques for bypassing web-scraper locks and their impact on the web-scraping process.
A program developed in the Python programming language that uses techniques to bypass web-scrapper locks is proposed. The program has a graphical interface developed using the Tkinter framework to create a web-scraping policy. Web scrapers bypassing blocking techniques use an open source framework to automate user actions in the Selenium WebDriver browser. A comparative analysis of the work of web scrapers showed that the use of the modules created in the work allows you to bypass the blocking of web scraping.
References
[2] Gundecha U. Selenium Testing Tools Cookbook. Birmingham B3 2PB, UK.: Packt Publishing, 2015. (In Eng.)
[3] Mitchell R. Web Scraping with Python. USA.: O’Reilly Media, 2015. (In Eng.)
[4] Hajba G. Website Scraping with Python: Using BeautifulSoup. USA.: O’Reilly Media, 2018. (In Eng.)
[5] Nair G. Getting Started with Beautiful Soup. USA.: Packt Publishing, 2014. (In Eng.)
[6] Shrenk M. Webbots, spiders, and screen scrapers. USA.: Packt Publishing, 2012. (In Eng.)
[7] Buelta J. Python Automation Cookbook. USA.: Packt Publishing, 2018. (In Eng.)
[8] Koundal D. Ontology Based Crawler: Semantic web application USA.: Lambert, 2013. (In Eng.)
[9] Ferrara E., De Meo P., Fiumara G., Baumgartner R. Web Data Extraction, Applications and Techniques: A Survey. Knowledge-Based Systems. 2014; 70:301-323. (In Eng.) DOI: 10.1016/j.knosys.2014.07.007
[10] Liang H., Zhu J.J.H Big Data, Collection of (Social Media, Harvesting). The International Encyclopedia of Communication Research Methods. 2017; 1-18. (In Eng.) DOI: 10.1002/9781118901731.iecrm0015
[11] Hirschey J.K. Symbiotic Relationships: Pragmatic Acceptance of Data Scraping. Berkeley Technology Law Journal. 2014; 29:897-927. Available at: https://www.jstor.org/stable/24119959 (accessed 06.04.2019). (In Eng.)
[12] Liu H., Morstatter F., Tang J., Zafarani R. The good, the bad, and the ugly: uncovering novel research opportunities in social media mining. International Journal of Data Science and Analytics. 2016; 1(3-4):137-143. (In Eng.) DOI: 10.1007/s41060-016-0023-0
[13] Thomsen J.G., Ernst E., Brabrand C., Schwartzbach M. WebSelF: A Web Scraping Framework. In: Brambilla M., Tokuda T., Tolksdorf R. (eds). Proceedings of the 12th international conference on Web Engineering (ICWE'12). Springer-Verlag, Berlin, Heidelberg. 2012; 347-361. (In Eng.) DOI: 10.1007/978-3-642-31753-8_28
[14] Salerno J.J., Boulware D.M. Method and apparatus for improved web scraping. United States of America, 2003. Patentnr. US 7072890 B2 Available at: https://patents.google.com/patent/US7072890B2/en (accessed 06.04.2019). (In Eng.)
[15] Thomas D.M., Mathur S. Data Analysis by Web Scraping using Python. 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA). Coimbatore, India. 2019; 450-454. (In Eng.) DOI: 10.1109/ICECA.2019.8822022
[16] Trifa A., Sbaï A.H., Chaari W.L. Evaluate a Personalized Multi Agent System through Social Networks: Web Scraping. 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE). Poznan. 2017; 18-20. (In Eng.) DOI: 10.1109/WETICE.2017.14
[17] Marques P. et al. Detecting Malicious Web Scraping Activity: A Study with Diverse Detector. 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC). Taipei, Taiwan. 2018; 269-278. (In Eng.) DOI: 10.1109/PRDC.2018.00049
[18] Raulamo-Jurvanen P., Kakkonen K., Mäntylä M. Using Surveys and Web-Scraping to Select Tools for Software Testing Consultancy. In: Abrahamsson P., Jedlitschka A., Nguyen Duc A., Felderer M., Amasaki S., Mikkonen T. (eds). Product-Focused Software Process Improvement. PROFES 2016. Lecture Notes in Computer Science. Springer, Cham. 2016; 10027:285-300. (In Eng.) DOI: 10.1007/978-3-319-49094-6_18
[19] Budiarti R. P. N., Widyatmoko N., Hariadi M., Purnomo M.H. Web scraping for automated water quality monitoring system: A case study of PDAM Surabaya. 2016 International Seminar on Intelligent Technology and Its Applications (ISITIA). Lombok. 2016; 641-648. (In Eng.) DOI: 10.1109/ISITIA.2016.7828735
[20] Boeing G., Waddell P. New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings. Journal of Planning Education and Research. 2016; 37(4):457-476. (In Eng.) DOI: 10.1177/0739456X16664789
[21] Adamuz P.L. Development of a generic test-bed for web scraping. Barcelona: European Education and Training Accreditation Center, 2015.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Publication policy of the journal is based on traditional ethical principles of the Russian scientific periodicals and is built in terms of ethical norms of editors and publishers work stated in Code of Conduct and Best Practice Guidelines for Journal Editors and Code of Conduct for Journal Publishers, developed by the Committee on Publication Ethics (COPE). In the course of publishing editorial board of the journal is led by international rules for copyright protection, statutory regulations of the Russian Federation as well as international standards of publishing.
Authors publishing articles in this journal agree to the following: They retain copyright and grant the journal right of first publication of the work, which is automatically licensed under the Creative Commons Attribution License (CC BY license). Users can use, reuse and build upon the material published in this journal provided that such uses are fully attributed.