Constrained Approximate Search Algorithms in Knowledge Discovery

Abstract

Knowledge discovery in big data is one of the most important applications of computing machinery today. Search is essential part of all such procedures. Search algorithms must be extremely efficient, but at the same time knowledge discovery procedures must not produce too many false positives or false negatives. False positives require post-processing, which reduces the overall efficiency of the knowledge discovery procedures, while false negatives reduce the sensitivity of such procedures. To reduce the false positive and false negative rate, in this paper, constrained approximate search algorithms are proposed to be applied. An overview of search theory, exact and approximate, is given first, exposing fundamentals of dynamic programming-based and bit-parallel-based approximate search algorithms without constraints. Then, introduction of constraints specific for various knowledge discovery procedures is explained, together with the subtleties of various applications, such as SPAM filtering and digital and network forensics (file carving, intrusion detection in hosts and networks). Advantages and disadvantages of applications of such constrained search algorithms in knowledge discovery procedures are also discussed. A potential application in bioinformatics is outlined.

Author Biographies

Slobodan Petrović, Norwegian University of Science and Technology

Professor of the Department of Information Security and Communication Technology, Faculty of Information Technology and Electrical Engineering, Ph.D. (Engineering)

Julia Sidorova, Blekinge Institute of Technology

Associate Professor of the Department of Computer Science, Ph.D. (Engineering)

References

[1] Navarro G., Raffinot M. Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge: Cambridge University Press; 2002. (In Eng.) DOI: https://doi.org/10.1017/CBO9781316135228
[2] Baeza-Yates R., Gonnet G.H. A new approach to text searching. Communications of the ACM. 1992; 35(10):74-82. (In Eng.) DOI: https://doi.org/10.1145/135239.135243
[3] Chitrakar A., Petrović S. Approximate search with constraints on indels with application in SPAM filtering. In: Proceedings of Norwegian Information Security Conference (NISK-2015),Ålesund, Norway; 2015. p. 22-33. Available at: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2380987 (accessed 21.1.2020). (In Eng.).
[4] Knuth D.E., Morris Jr. J.H., Pratt V.R. Fast Pattern Matching in Strings. SIAM Journal on Computing. 1977; 6(2): 323-350. (In Eng.) DOI: https://doi.org/10.1137/0206024
[5] Boyer R.S., Moore J.S. A fast string searching algorithm. Communications of the ACM. 1977; 20(10):762-772. (In Eng.) DOI: https://doi.org/10.1145/359842.359859
[6] Horspool R. Practical Fast Searching in Strings. Software: Practice and Experience. 1980; 10(6):501-506. (In Eng.) DOI: https://doi.org/10.1002/spe.4380100608
[7] Sunday D.M. A very fast substring search algorithm. Communications of the ACM. 1990; 33(8):132-142. (In Eng.) DOI: https://doi.org/10.1145/79173.79184
[8] Crochemore M., Rytter W. Text algorithms. Oxford University Press, Inc., USA; 1994. (In Eng.).
[9] Navarro G., Raffinot M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics. 2000; 5:4-es. (In Eng.) DOI: https://doi.org/10.1145/351827.384246
[10] Zhang Y., Liu P., Liu Y., Li A., Du C., Fan D. Attacking Pattern Matching Algorithms Based on the Gap between Average-Case and Worst-Case Complexity. Journal on Advances in Computer Network. 2013; 1(3):228-233. (In Eng.) DOI: https://doi.org/10.7763/JACN.2013.V1.45
[11] Sipser M. Introduction to the Theory of Computation, 2nd ed. Thomson, USA; 2006. (In Eng.).
[12] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966; 10(8):707-710. (In Eng.).
[13] Wagner R.A., Fischer M.J. The String-to-String Correction Problem. Journal of the ACM. 1974; 21(1):168-173. (In Eng.) DOI: https://doi.org/10.1145/321796.321811
[14] Oommen B.J. Constrained String Editing. Information Sciences. 1986; 40(3):267-284. (In Eng.) DOI: https://doi.org/10.1016/0020-0255(86)90061-7
[15] Hyyrö H., Navarro G. Faster Bit-Parallel Approximate String Matching. In: A. Apostolico, M. Takeda (ed.) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol.2373. Springer, Berlin, Heidelberg; 2002. p. 203-224. (In Eng.) DOI: https://doi.org/10.1007/3-540-45452-7_18
[16] Wu S., Manber U. Fast text searching: allowing errors. Communications of the ACM. 1992; 35(10):83-91. (In Eng.) DOI: https://doi.org/10.1145/135239.135244
[17] Golić J.Dj., Mihaljević M.J. A generalized correlation attack on a class of stream ciphers based on the Levenshtein distance. Journal of Cryptology. 1991; 3(3):201-212. (In Eng.) DOI: https://doi.org/10.1007/BF00196912
[18] Schryen G. Anti-SPAM Measures: Analysis and Design. Springer, Berlin, Heidelberg; 2007. (In Eng.) DOI: https://doi.org/10.1007/978-3-540-71750-8
[19] Chitrakar A., Petrović S. Constrained Row-Based Bit-Parallel Search in Intrusion Detection. In: Proceedings of Norwegian Information Security Conference (NISK-2016), Bergen, Norway; 2016. p. 68-79. Available at: https://ojs.bibsys.no/index.php/NISK/article/view/375 (accessed 21.1.2020). (In Eng.).
[20] Sagar S., Sidorova J. Sequence Retriever for Known, Discovered and User-Specified Molecular Fragments. In: M.S. Mohamad, M.P. Rocha, F. Fdez-Riverola, F.J. Domínguez Mayo, J.F. De Paz (ed.) 10th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2016. Advances in Intelligent Systems and Computing, vol. 477. Springer, Cham; 2016. p. 51-58. (In Eng.) DOI: https://doi.org/10.1007/978-3-319-40126-3_6
[21] Sidorova J., Fernandez A., Cester J., Rallo R., Giralt F. Predicting Biodegradable Quality of Chemicals with the TGI+.3 Classifier. In: Proceeding (717) Artificial Intelligence and Applications / 718: Modelling, Identification, and Control - 2011.Innsbruck, Austria; 2011. p. 108-115. (In Eng.) DOI: https://doi.org/10.2316/P.2011.717-044
[22] Sidorova J., Anisimova M. NLP-inspired structural pattern recognition in chemical application. Pattern Recognition Letters. 2014; 45:11-16. (In Eng.) DOI: https://doi.org/10.1016/j.patrec.2014.2.12
[23] Sidorova, J., Garcia, J.-2015. Bridging from syntactic to statistical methods: Classification with automatically segmented features from sequences. Pattern Recognition. 2015; 48:3749-3756. (In Eng.) DOI: https://doi.org/10.1016/j.patcog.2015.5.1
[24] Dietterich T.G., Bakiri G. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research. 1995; 2:263-286. (In Eng.) DOI: https://doi.org/10.1613/jair.105
[25] Anand R., Mehrotra K., Mohan C.K., Ranka S. Efficient classification for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks. 1995; 6(1):117-124. (In Eng.) DOI: https://doi.org/10.1109/72.363444
Published
2020-05-25
How to Cite
PETROVIĆ, Slobodan; SIDOROVA, Julia. Constrained Approximate Search Algorithms in Knowledge Discovery. Modern Information Technologies and IT-Education, [S.l.], v. 16, n. 1, p. 41-49, may 2020. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/624>. Date accessed: 16 sep. 2025. doi: https://doi.org/10.25559/SITITO.16.202001.41-49.
Section
Theoretical Questions of Computer Science, Computer Mathematics