Torah puzzle | Step 4-5 | Comparing two words |
[ 6 78064 ] [*** -2 -2 -2 ] [ 23046 -2037 12861 23046 0 54131 ] [ 49191 988 49191 54131 0 73576 ] [ 73471 21 73471 73576 0 78064 ] [*** -2 -2 -1 ] [ 923 950 923 5673 0 78064 ] [ 11942 -2238 752 11942 752 753 ] [ 14060 3846 14060 33290 14060 14061 ] [ 32123 -2030 21973 32123 923 78064 ] [ 32788 -3273 16423 32788 16423 16424 ] [ 37394 3289 37394 53839 21973 57788 ] [ 46623 -5556 18843 46623 18843 18844 ] [ 57788 -3282 41378 57788 21973 78064 ] [ 62337 -9742 13627 62337 13627 13628 ] [ 70073 -13255 3798 70073 3798 3799 ] [ 70397 -12761 6592 70397 6592 6593 ] [ 77016 -3843 57801 77016 41378 78064 ] [*** -2 -2 0 ] [ 29601 2480 29601 42001 0 60129 ] [ 38926 -6373 7061 38926 0 42001 ] . . . . . . . . . .and for the word "HYNTM" (Matanya):
[ 5 78064 ] [*** -2 -2 -2 ] [ 7675 -24 7579 7675 0 11299 ] [ 11247 13 11247 11299 0 12217 ] [ 12197 5 12197 12217 0 78064 ] [ 17451 13 17451 17503 12197 78064 ] [ 43802 33 43802 43934 17451 46903 ] [ 46839 16 46839 46903 17451 78064 ] [ 59073 -22 58985 59073 46839 78064 ] [ 70939 24 70939 71035 58985 78064 ] [*** -2 -2 -1 ] [ 1666 -35 1526 1666 0 5835 ] [ 5759 19 5759 5835 0 9925 ] . . . . . . . . . .The 4-5 step reads two (-2,-2,-2) families, one from the first file and the other from the second file, computes the distance and writes a line to the output file. Then it reads two (-2,-2,-1) families, computes the distance and writes another line to the output file; and so on. Each input file contains 125 families; the output file contains 125 lines.
[ -2 -2 -2 2.377977e+005 ] [ -2 -2 -1 3.450332e+005 ] . . . . . . [ 2 2 2 2.263009e+005 ]Though, sometimes an empty family may encounter. Then its counterpart in the other file is ignored, and the corresponding line is missing in the output file.
So, we have two families: the (-2,-2,-2) family from the first file, describing 3 perturbed ELS's matching the word "WHYQDC", and the (-2,-2,-2) family from the second file, describing 8 perturbed ELS's matching the word "HYNTM". How to compare them? We compare each line of one family with each line of the other family, getting 3 * 8 = 24 numbers. (Typically, there are about 100 numbers.) These 24 numbers are just added. Why add them? The numbers are so constructed that they are large when the corresponding ELS's are close. Each pair in close proximity contributes a lot to the sum. Thus, a large sum indicates that there is a close proximity somewhere (maybe, several times). You may say: however, a large sum can be made up of many small contributions. Maybe. As far as I understand, usually, large and small summands differ so much that the accumulation is of little importance. Maybe, instead of adding the numbers we could choose the maximal number. I do not know. Anyway, the sum is called the measure of proximity between two words.
So, the comparison of families is reduced to the comparison of their elements (perturbed ELS's). How to compare, say, the first element "[ 23046 -2037 12861 23046 0 54131 ]" of the first family with, say, the fourth element "[ 17451 13 17451 17503 12197 78064 ]" of the second family? Two factors are taken into account. The first factor is intended to emphasize noteworthy ELS's, that is, ELS's having large domains. The first ELS has its domain [0, 54131], the second [12197, 78064]. Their intersection [12197, 54131] is called the domain of simultaneous minimality for the two ELS's, and its length (decreased by 1), 54131 - 12197 = 41934, is the first factor. (According to the paper [WRR94] it should be normalized by dividing by the length of the Book, 78064, but the program does not divide it, moreover, it multiplies it by 100. I do not know why it does so, but anyway, it is harmless, since all distances are changed proportionally.) It may happen that the domain of simultaneous minimality degenerates into a single point or is empty; then the first factor is set to 1 (and afterwards, still, is multiplied by 100). It is somewhat odd, but harmless. The first factor, described above, is multiplied by the second factor, described below.
The second factor is the most time-consuming part of the WRR algorithm. I'd like to understand it better. The original idea is, to write the text of the Book on a giant cylinder, on which the text spirals down on one long line, and to choose the optimal row length for the two considered ELS's, so that both words fit in a small (that is, as small as possible) region on the giant cylinder. The optimization must be made quickly. You see, it is performed about 100 times for each of the 125 families, all that for a single pair of words! We need something more effective than trying all possible row lengths. The WRR algorithm tries the 20 most promising values of the row length h. These are |d1|, |d1|/2, ... |d1|/10 and |d2|, |d2|/2, ... |d2|/10, rounded to integers; here d1, d2 are skips of the two considered ELS's, and |d1|, |d2| are their absolute values (signs ignored). Each value of h gives a number. As for me, it would be natural, to choose the maximal among the 20 numbers. However, the WRR algorithm adds them.
So, what is the number characterizing the close proximity of two ELS's on the cylinder with a given row length? The following example is borrowed from the paper [WRR94] (see Fig. 2 there); the Hebrew text is written from the right to the left; irrelevant positions are filled by dots:
............ .M.T.N.Y.H.. ..........C. ............ ..........D. ............ ..........Q. ............ ..........Y. ............ ..........H. ............ ..........W. ............Here h = 3876; one ELS (start=24185, skip=-2) matches "HYNTM", the other ELS (start = 28052, skip = 7752) matches "WHYQDC". Note that h = |d2|/2. Note also that the second ELS was seen as a column in "Step 1-2", while the first ELS is a member of the (0,0,0) family shown on "Data 3". We consider three numbers:
| back to Data 4 | Step 4-5 | Data 5 | Program code |
| back to Steps of computation | back to Impossible Facts | back to my home page |