Constrained Approximate Search Algorithms in Knowledge Discovery
Abstract
Knowledge discovery in big data is one of the most important applications of computing machinery today. Search is essential part of all such procedures. Search algorithms must be extremely efficient, but at the same time knowledge discovery procedures must not produce too many false positives or false negatives. False positives require post-processing, which reduces the overall efficiency of the knowledge discovery procedures, while false negatives reduce the sensitivity of such procedures. To reduce the false positive and false negative rate, in this paper, constrained approximate search algorithms are proposed to be applied. An overview of search theory, exact and approximate, is given first, exposing fundamentals of dynamic programming-based and bit-parallel-based approximate search algorithms without constraints. Then, introduction of constraints specific for various knowledge discovery procedures is explained, together with the subtleties of various applications, such as SPAM filtering and digital and network forensics (file carving, intrusion detection in hosts and networks). Advantages and disadvantages of applications of such constrained search algorithms in knowledge discovery procedures are also discussed. A potential application in bioinformatics is outlined.
References
[2] Baeza-Yates R., Gonnet G.H. A new approach to text searching. Communications of the ACM. 1992; 35(10):74-82. (In Eng.) DOI: https://doi.org/10.1145/135239.135243
[3] Chitrakar A., Petrović S. Approximate search with constraints on indels with application in SPAM filtering. In: Proceedings of Norwegian Information Security Conference (NISK-2015),Ålesund, Norway; 2015. p. 22-33. Available at: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2380987 (accessed 21.1.2020). (In Eng.).
[4] Knuth D.E., Morris Jr. J.H., Pratt V.R. Fast Pattern Matching in Strings. SIAM Journal on Computing. 1977; 6(2): 323-350. (In Eng.) DOI: https://doi.org/10.1137/0206024
[5] Boyer R.S., Moore J.S. A fast string searching algorithm. Communications of the ACM. 1977; 20(10):762-772. (In Eng.) DOI: https://doi.org/10.1145/359842.359859
[6] Horspool R. Practical Fast Searching in Strings. Software: Practice and Experience. 1980; 10(6):501-506. (In Eng.) DOI: https://doi.org/10.1002/spe.4380100608
[7] Sunday D.M. A very fast substring search algorithm. Communications of the ACM. 1990; 33(8):132-142. (In Eng.) DOI: https://doi.org/10.1145/79173.79184
[8] Crochemore M., Rytter W. Text algorithms. Oxford University Press, Inc., USA; 1994. (In Eng.).
[9] Navarro G., Raffinot M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics. 2000; 5:4-es. (In Eng.) DOI: https://doi.org/10.1145/351827.384246
[10] Zhang Y., Liu P., Liu Y., Li A., Du C., Fan D. Attacking Pattern Matching Algorithms Based on the Gap between Average-Case and Worst-Case Complexity. Journal on Advances in Computer Network. 2013; 1(3):228-233. (In Eng.) DOI: https://doi.org/10.7763/JACN.2013.V1.45
[11] Sipser M. Introduction to the Theory of Computation, 2nd ed. Thomson, USA; 2006. (In Eng.).
[12] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966; 10(8):707-710. (In Eng.).
[13] Wagner R.A., Fischer M.J. The String-to-String Correction Problem. Journal of the ACM. 1974; 21(1):168-173. (In Eng.) DOI: https://doi.org/10.1145/321796.321811
[14] Oommen B.J. Constrained String Editing. Information Sciences. 1986; 40(3):267-284. (In Eng.) DOI: https://doi.org/10.1016/0020-0255(86)90061-7
[15] Hyyrö H., Navarro G. Faster Bit-Parallel Approximate String Matching. In: A. Apostolico, M. Takeda (ed.) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol.2373. Springer, Berlin, Heidelberg; 2002. p. 203-224. (In Eng.) DOI: https://doi.org/10.1007/3-540-45452-7_18
[16] Wu S., Manber U. Fast text searching: allowing errors. Communications of the ACM. 1992; 35(10):83-91. (In Eng.) DOI: https://doi.org/10.1145/135239.135244
[17] Golić J.Dj., Mihaljević M.J. A generalized correlation attack on a class of stream ciphers based on the Levenshtein distance. Journal of Cryptology. 1991; 3(3):201-212. (In Eng.) DOI: https://doi.org/10.1007/BF00196912
[18] Schryen G. Anti-SPAM Measures: Analysis and Design. Springer, Berlin, Heidelberg; 2007. (In Eng.) DOI: https://doi.org/10.1007/978-3-540-71750-8
[19] Chitrakar A., Petrović S. Constrained Row-Based Bit-Parallel Search in Intrusion Detection. In: Proceedings of Norwegian Information Security Conference (NISK-2016), Bergen, Norway; 2016. p. 68-79. Available at: https://ojs.bibsys.no/index.php/NISK/article/view/375 (accessed 21.1.2020). (In Eng.).
[20] Sagar S., Sidorova J. Sequence Retriever for Known, Discovered and User-Specified Molecular Fragments. In: M.S. Mohamad, M.P. Rocha, F. Fdez-Riverola, F.J. Domínguez Mayo, J.F. De Paz (ed.) 10th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2016. Advances in Intelligent Systems and Computing, vol. 477. Springer, Cham; 2016. p. 51-58. (In Eng.) DOI: https://doi.org/10.1007/978-3-319-40126-3_6
[21] Sidorova J., Fernandez A., Cester J., Rallo R., Giralt F. Predicting Biodegradable Quality of Chemicals with the TGI+.3 Classifier. In: Proceeding (717) Artificial Intelligence and Applications / 718: Modelling, Identification, and Control - 2011.Innsbruck, Austria; 2011. p. 108-115. (In Eng.) DOI: https://doi.org/10.2316/P.2011.717-044
[22] Sidorova J., Anisimova M. NLP-inspired structural pattern recognition in chemical application. Pattern Recognition Letters. 2014; 45:11-16. (In Eng.) DOI: https://doi.org/10.1016/j.patrec.2014.2.12
[23] Sidorova, J., Garcia, J.-2015. Bridging from syntactic to statistical methods: Classification with automatically segmented features from sequences. Pattern Recognition. 2015; 48:3749-3756. (In Eng.) DOI: https://doi.org/10.1016/j.patcog.2015.5.1
[24] Dietterich T.G., Bakiri G. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research. 1995; 2:263-286. (In Eng.) DOI: https://doi.org/10.1613/jair.105
[25] Anand R., Mehrotra K., Mohan C.K., Ranka S. Efficient classification for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks. 1995; 6(1):117-124. (In Eng.) DOI: https://doi.org/10.1109/72.363444

This work is licensed under a Creative Commons Attribution 4.0 International License.
Publication policy of the journal is based on traditional ethical principles of the Russian scientific periodicals and is built in terms of ethical norms of editors and publishers work stated in Code of Conduct and Best Practice Guidelines for Journal Editors and Code of Conduct for Journal Publishers, developed by the Committee on Publication Ethics (COPE). In the course of publishing editorial board of the journal is led by international rules for copyright protection, statutory regulations of the Russian Federation as well as international standards of publishing.
Authors publishing articles in this journal agree to the following: They retain copyright and grant the journal right of first publication of the work, which is automatically licensed under the Creative Commons Attribution License (CC BY license). Users can use, reuse and build upon the material published in this journal provided that such uses are fully attributed.