<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">bsuir</journal-id><journal-title-group><journal-title xml:lang="ru">Доклады БГУИР</journal-title><trans-title-group xml:lang="en"><trans-title>Doklady BGUIR</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1729-7648</issn><issn pub-type="epub">2708-0382</issn><publisher><publisher-name>БГУИР</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.35596/1729-7648-2020-18-1-29-34</article-id><article-id custom-type="elpub" pub-id-type="custom">bsuir-2589</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ЭЛЕКТРОНИКА, РАДИОФИЗИКА, РАДИОТЕХНИКА, ИНФОРМАТИКА</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>ELECTRONICS, RADIOPHYSICS, RADIOENGINEERING, INFORMATICS</subject></subj-group></article-categories><title-group><article-title>A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES</article-title><trans-title-group xml:lang="en"><trans-title>A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Nasr</surname><given-names>S.</given-names></name><name name-style="western" xml:lang="en"><surname>Nasr</surname><given-names>S.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Sara Nasr, PG student of Information Technologies in Automatized Systems Department</p></bio><bio xml:lang="en"><p>Sara Nasr, PG student of Information Technologies in Automatized Systems Department</p></bio><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>German</surname><given-names>O. V.</given-names></name><name name-style="western" xml:lang="en"><surname>German</surname><given-names>O. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>German Oleg Vitoldovich, PhD, Associate Professor of the Information Technologies in Automatized Systems Department</p><p>220600, Minsk, P. Brovki str. 6, +375-29-612-42-32</p></bio><bio xml:lang="en"><p>German Oleg Vitoldovich, PhD, Associate Professor of the Information Technologies in Automatized Systems Department</p><p>220600, Republic of Belarus, Minsk, P. Brovki str. 6, +375-29-612-42-32</p></bio><email xlink:type="simple">ovgerman@tut.by</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>The Belarussian State University of Informatics and RadioElectronics</institution></aff><aff xml:lang="en"><institution>The Belarussian State University of Informatics and RadioElectronics</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2020</year></pub-date><pub-date pub-type="epub"><day>06</day><month>03</month><year>2020</year></pub-date><volume>18</volume><issue>1</issue><fpage>29</fpage><lpage>34</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Nasr S., German O.V., 2020</copyright-statement><copyright-year>2020</copyright-year><copyright-holder xml:lang="ru">Nasr S., German O.V.</copyright-holder><copyright-holder xml:lang="en">Nasr S., German O.V.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://doklady.bsuir.by/jour/article/view/2589">https://doklady.bsuir.by/jour/article/view/2589</self-uri><abstract><p>The paper contains a new text searching method representing modification of the Boyer-Moore algorithm and enabling a user to find the places in the text where the given substring occurs maybe with possible errors, that is the string in text and a query may not coincide but nevertheless are identical. The idea consists in division of the searching process in two phases: at the first phase a fuzzy variant of the Boyer–Moore algorithm is performed; at the second phase the Dice metrics is used. The advantage of suggested technique in comparison with the known methods using the fixed value of the mistakes number is that it 1) does not perform precomputation of the auxiliary table of the sizes comparable to the original text sizes and 2) it more flexibly catches the semantics of the erroneous text substrings even for a big number of mistakes. This circumstance extends possibilities of the Boyer–Moore method by addmitting a bigger amount of possible mistakes in text and preserving text semantics. The suggested method provides also more accurate regulation of the upper boundary for the text mistakes which differs it from the known methods with fixed value of the maximum number of mistakes not depending on the text sizes. Moreover, this upper boundary is defined as Levenshtein distance not suitable for evaluating a relevance of the founded text and a query, while the Dice metrics provides such a relevance. In fact, if maximum Levenshtein distanse is 3 then how one can judge if this value is big or small to provide relevance of the search results. Consequently, the suggested method is more flexible, enables one to find relevant answers even in case of a big number of mistakes in text. The efficiency of the suggested method in the worst case is O(nc) with constant c defining the biggest allowable number of mistakes.</p></abstract><trans-abstract xml:lang="en"><p>The paper contains a new text searching method representing modification of the Boyer-Moore algorithm and enabling a user to find the places in the text where the given substring occurs maybe with possible errors, that is the string in text and a query may not coincide but nevertheless are identical. The idea consists in division of the searching process in two phases: at the first phase a fuzzy variant of the Boyer–Moore algorithm is performed; at the second phase the Dice metrics is used. The advantage of suggested technique in comparison with the known methods using the fixed value of the mistakes number is that it 1) does not perform precomputation of the auxiliary table of the sizes comparable to the original text sizes and 2) it more flexibly catches the semantics of the erroneous text substrings even for a big number of mistakes. This circumstance extends possibilities of the Boyer–Moore method by addmitting a bigger amount of possible mistakes in text and preserving text semantics. The suggested method provides also more accurate regulation of the upper boundary for the text mistakes which differs it from the known methods with fixed value of the maximum number of mistakes not depending on the text sizes. Moreover, this upper boundary is defined as Levenshtein distance not suitable for evaluating a relevance of the founded text and a query, while the Dice metrics provides such a relevance. In fact, if maximum Levenshtein distanse is 3 then how one can judge if this value is big or small to provide relevance of the search results. Consequently, the suggested method is more flexible, enables one to find relevant answers even in case of a big number of mistakes in text. The efficiency of the suggested method in the worst case is O(nc) with constant c defining the biggest allowable number of mistakes.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>searching algorithm</kwd><kwd>text processing</kwd><kwd>finding text with mistakes</kwd><kwd>fuzzy Boyer–Moore method</kwd><kwd>metrics of similarity</kwd></kwd-group><kwd-group xml:lang="en"><kwd>searching algorithm</kwd><kwd>text processing</kwd><kwd>finding text with mistakes</kwd><kwd>fuzzy Boyer–Moore method</kwd><kwd>metrics of similarity</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Manning Ch. D., Raghavan P., Schutze H. An introduction to information retrieval. Cambridge University Press; 2009.</mixed-citation><mixed-citation xml:lang="en">Manning Ch. D., Raghavan P., Schutze H. An introduction to information retrieval. Cambridge University Press; 2009.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Wu S., Manber U. Fast text searching allowing errors. Communication of the ACM. 2001;35:83-91.</mixed-citation><mixed-citation xml:lang="en">Wu S., Manber U. Fast text searching allowing errors. Communication of the ACM. 2001;35:83-91.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Gusfield D. Algorithms on strings, trees and sequences. Computer science and computational biology. Cambridge university press; 1997.</mixed-citation><mixed-citation xml:lang="en">Gusfield D. Algorithms on strings, trees and sequences. Computer science and computational biology. Cambridge university press; 1997.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Galil Z., Giancarlo R. Data structures and algorithms for approximate string matching. Journal of Complexity. 1988;4:33-72.</mixed-citation><mixed-citation xml:lang="en">Galil Z., Giancarlo R. Data structures and algorithms for approximate string matching. Journal of Complexity. 1988;4:33-72.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Gonnet G.H., Baeza-Yates R.A. Handbook of algorithms and data structures. Addison-Wesley. 1991.</mixed-citation><mixed-citation xml:lang="en">Gonnet G.H., Baeza-Yates R.A. Handbook of algorithms and data structures. Addison-Wesley. 1991.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Mandel I. [Cluster analysis]. M.: Finansy i statistika; 1988. (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Mandel I. [Cluster analysis]. M.: Finansy i statistika; 1988. (In Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Gourine N.I., German O.V. [Intelligent query analyzer to knowledge base of the multimedia electronic tutorial]. Trudy BGTU=Proceedings of BSTU. 2010;18(6):167-170. (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Gourine N.I., German O.V. [Intelligent query analyzer to knowledge base of the multimedia electronic tutorial]. Trudy BGTU=Proceedings of BSTU. 2010;18(6):167-170. (In Russ.)</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
