A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES

. The paper contains a new text searching method representing modification of the Boyer-Moore algorithm and enabling a user to find the places in the text where the given substring occurs maybe with possible errors, that is the string in text and a query may not coincide but nevertheless are identical. The idea consists in division of the searching process in two phases: at the first phase a fuzzy variant of the Boyer–Moore algorithm is performed; at the second phase the Dice metrics is used. The advantage of suggested technique in comparison with the known methods using the fixed value of the mistakes number is that it 1) does not perform precomputation of the auxiliary table of the sizes comparable to the original text sizes and 2) it more flexibly catches the semantics of the erroneous text substrings even for a big number of mistakes. This circumstance extends possibilities of the Boyer–Moore method by addmitting a bigger amount of possible mistakes in text and preserving text semantics. The suggested method provides also more accurate regulation of the upper boundary for the text mistakes which differs it from the known methods with fixed value of the maximum number of mistakes not depending on the text sizes. Moreover, this upper boundary is defined as Levenshtein distance not suitable for evaluating a relevance of the founded text and a query, while the Dice metrics provides such a relevance. In fact, if maximum Levenshtein distanse is 3 then how one can judge if this value is big or small to provide relevance of the search results. Consequently, the suggested method is more flexible, enables one to find relevant answers even in case of a big number of mistakes in text. The efficiency of the suggested method in the worst case is O ( nc ) with constant c defining the biggest allowable number of mistakes.


Introduction
Text searching methods are widely required in modern text-based applications. Let us note CV (Curriculum Vitae) and paper abstracts processing, extracting an invention formula description from patents, e-mails filtering and so on. The problem is stated generally as answering queries to some text(s). To find an answer different text searching techniques can be applied [1,2]. Clearly, the efficiency of a text searching technique essentially depends on the text sizes, its structure and presence of syntax mistakes. From this viewpoint, the methods based on text tagging are not suitable if text is poorly organized and unstructured. The keyword-based seach methods are not efficient in large text databases, for instance, on web-servers proposing job offers with billions of files. The regular expression based methods may be not efficient if text contains mistakes as it is difficult (if not possible) to envisage the concrete mistakes in text of different topics. So, we selected sequential text searching approach which seems to be most reasonable in the case of short, poorly organized texts with different topics and especially in case with a lot of text files to be processed. It is the Boyer-Moore method [3] we selected as a prototype in our approach. It can be efficiently used to find a substring in the text with computational complexity estimated as O(n+m) where n is size of the text and m stands for a query length. However, this method presumes exact coincidence of the original substring and those which are being looked for. The methods to find substring occurences in text with mistakes can be found in [4,5]. They are different from the Boyer-Moore scheme though. These methods suggest that some constant k  0 restricts the utmost mistakes amount. However, it is difficult to define the value of k a priori especially if text is unknown and overloaded with mistakes. We suggest to use two constants k2 > k1  0 and organize a fuzzy searching procedure as follows. The modified Boyer-Moore procedure continues till the number of the detected mistakes not exceeds k2, otherwise the right-side shift of the query is performed with computed shift value. If the number of mistakes does not exceed k1 then the searched text word is recognized by means of the modified Boyer-Moore method. If the number of mistakes lays in the diapazone [k1 +1, k2] then the text word is processed with the help of fuzzy comparison based on Dice metrics [6]. If Dice metrics is greater than 0.5 then the answer is found. Otherwise the answer is not found and right-side shift is performed accordingly to the modified Boyer-Moore strategy. We consider a query string as a correct one and modify currently observed substring in the text as a copy of the query. We use Levenstein distance between the strings to estimate their similarity. It is used as a marker signalling that strings (or their parts) are equal or different. The main problem is to define the value of the shift (the number of symbols to omit) to be made if the compared substrings are different. The rest of the paper explains the details of how this and the other details are realized.

The Boyer-Moore algorithm illustration
To be more concrete let us consider the following example. Text string: awbsycsdesacscdssqwass. Substring to find: sacs. Let's consider an approach of Boyer-Moore [3]. First, in the substring to find one defines the number of occurrences of each symbol and their corresponding positions. Thus, with respect to our substring one has: <c -1(2); a -1(3), s -2(1,4)>. In the round brackets the symbol offsets from the end of the word are placed, for instance, c -1(2) means that symbol c is encountered only once in the word sacs and takes the second position from the right. At the second step the searching procedure starts by sequential comparison of the last symbol in sacs (i.e.  s) with the currently observed symbol in the Text string until the coincidence takes place. Thus, we have:

a w b s y desacscdssqwass s a c s
Now let's compare the corresponding symbols from right to left. Because c < > b the procedure stops. It is possible to shift word sacs to the right by 3 positions to superpose the first symbol s in sacs with the currently observed symbol s. Indeed, no other symbols s besides the leftmost and the rightmost ones in sacs are available. So the next comparison is realized as below awb s y c s desacscdssqwass s a c s etc.

Mistakes in the strings and their correction
The next errors in words are possible: -S1: a letter is wrong in the given position; -S2: a letter is omitted;

31
-S3: extra letter is inserted; -S4: two adjacent letters are interchanged We need some criteria to reveal the above pointed states. Our consideration is restricted with these situations only.
Situation S1. To reveal this situation one needs to read the corresponding next symbols in both compared strings. If those symbols coincide then the resulting answer is YES (true) otherwise -NO (false). The NO answer is generated also if there are no next symbols (in both or some of strings).
While comparing the strings we compute the Levenstein distance between them. We use this distance to come to conclusion that two substrings are different or not. For this, we base ourselves on the experimental results [7] presented in Table 1 and giving value of k1 We, however, change Table 1 to define k2 as in Table 2. The text errors here correspond to situations S1, S2, S3, S4 described above. The situation S4 is considered as a single error.

Modified Boyer-Moore algorithm
The idea of method based on [4,5] can be explained as follows. We try to simulate the Boyer-Moore algorithm comparing symbol by symbol from right to left in both strings with query under the text. The query string is supposed to be written correctly. If symbols do not coincide we admit that some mistake is encountered in the CV-text. We define a type of mistake (situations S1, S2, S3, S4). According to the Table 2 we test if the number of mistakes is still admissable, that is lies in diapason Dp = [0, k2]. If the answer is positive, we make suitable shifts in both strings. In this case we replace an erroneous place in the CV-text by a supposedly correct symbol from the query. If the number of mistakes is beyond the diapason Dp we restore the corrected mistakes in the CV-text and make a right-side shift of the query. We explain how to define the shift value below by means of example.
CV-text: x x x ... x      ...  query:  We use the symbol  to designate exactly coinciding symbol with the corresponding one in the query; the symbol  is used to designate a mistaken symbol which was changed;  (x) is used to designate the rest right-side (left-side) symbols in the currently observed part of CV. We admit maximum 3 non-coincidences in both substrings. There are 3 mistakes in the example above. So we should compute the Dice distance (metrics) between two strings. Suppose, that the Dice metrics is less than 0.5. This means that we need to make a right-side shift of the query. Now let us suppose that there are no the same (repeating) symbols in the query. Our task is to define the shift value. Let us make one-symbol right-side shift of the query. The situation will change as shown below We designated repeating symbol as . One can apply the above given technique for the nonrepeating symbols. As before one should define a new position of * with two (or less) pairs    in both strings after *. The desired configuration is shown below Again it required of us to make 5 shifts. Notice that configuration preceding the previous one was that one below with 3 mistakes what exceeds the allowed number of mistakes CV-text: x x x ... x   ...  query: * Here are 3 mistakes and one more right shift is required. We estimate the worst case complexity of the method for the unrepeating symbols case admitting that the mean number of mistakes allowed is c.
The worst case means c + 1ending miscorrespondings with only one shift to the right each time. This gives n•(c+1) comparisons at the worst. Suppose the query length is k > 3. Then at the worst we make k comparisons and 3 shifts to the right. We conclude these considerations with the worst case shown below has P = 2•3 / (5+5) = 0.6. We adopt the rule accordingly to which two words are considered alike if the Dice measure is 0.5 or greater. From this, in the example above the words X and Y are considered to be the same.
Summarization of the searching procedure can be represented in the next form. 1. At each iteration the procedure compares symbol by symbol from right to left in both strings with query under the text (the query string is supposed to be written correctly).
2. If all symbols in the query have been compared with CV-text fragment (cvf) then. 2.1 If the number of mistakes encountered is in diapason Dp = [0, k1] then cvf is accepted as successfully recognized and searching procedure resumes from the next right symbol following cvf (if any, otherwise the procedure finishes).
2.2 If the number of mistakes encountered is in diapason Dp = [k1+1, k2] then cvf is compared with query by means of Dice metrics. If comparison is successfull the cvf is accepted as successfully recognized and searching procedure resumes from the next right symbol following cvf (if any, otherwise the procedure finishes).
2.3 Otherwise (1.1 and 1.2 failed) the algorithm restores the corrected mistakes in the CV-text and makes a right-side shift of the query with resuming searching procedure from step 5.
3. If symbols do not coincide then some mistake is encountered in the CV-text. If the number of mistakes is still admissable, that is, lies in diapason Dp = [0, k2] the algorithm produces suitable shifts in both strings and replaces an erroneous place in the CV-text by a supposedly correct symbol from the query. Return to step 2.
4. If the number of mistakes is beyond the diapason Dp the algorithm restores the corrected mistakes in the CV-text and makes a right-side shift of the query.
5. If the rightmost symbol of the query is beyond CV-text then stop searching procedure, otherwise return to step 2.

Conclusion
The suggested method combines techniques of the fuzzy Boyer-Method with verifying string similarity on the basis of Dice metrics. This provides some important advantages in comparison with the Boyer-Moore method as it gives a possibility to increase the number of mistakes in the text and still to preserve the semantics of the original text. The other advantage of the suggested approach in comparison with the known fuzzy versions of the Boyer-Moore method consists in more sensitive regulation of the allowable upper number of errors in the word to be recognized. In existing approaches the number of mistakes is fixed and does not depend on the text length. Moreover, this number is defined in the procedure as Levenstein metrics (i.e. as absolute number of mistakes). This does not help to make a conclusion about processed text relevance to a query (on the contrary, the Dice metrics gives a clear answer). Indeed, if the Levenstein distance is equal to 3, then how one can judge if this number is big or, on the contrary, low. Therefore, the suggested method enables one to decrease k1 and even increase k2 to make the recognition process more efficient. The investigation of this and related questions remain the future landmarks of the suggested approach.