Torah puzzle

Step 4-5

Comparing two words

This is the central step of the computation. (By the way, it is also the most time-consuming step.) The steps 1-2, 2-3, and 3-4 are performed twice: for one word and, independently, for the other word, giving two files of the "Data 4" type. For example, "Data 4" for the word "WHYQDC" (Zedekia):
[   6  78064 ]
[*** -2 -2 -2 ]
[   23046 -2037  12861 23046      0 54131 ]
[   49191   988  49191 54131      0 73576 ]
[   73471    21  73471 73576      0 78064 ]
[*** -2 -2 -1 ]
[     923   950    923  5673      0 78064 ]
[   11942 -2238    752 11942    752   753 ]
[   14060  3846  14060 33290  14060 14061 ]
[   32123 -2030  21973 32123    923 78064 ]
[   32788 -3273  16423 32788  16423 16424 ]
[   37394  3289  37394 53839  21973 57788 ]
[   46623 -5556  18843 46623  18843 18844 ]
[   57788 -3282  41378 57788  21973 78064 ]
[   62337 -9742  13627 62337  13627 13628 ]
[   70073 -13255   3798 70073   3798  3799 ]
[   70397 -12761   6592 70397   6592  6593 ]
[   77016 -3843  57801 77016  41378 78064 ]
[*** -2 -2  0 ]
[   29601  2480  29601 42001      0 60129 ]
[   38926 -6373   7061 38926      0 42001 ]
   .   .   .   .   .   .   .   .   .   .
and for the word "HYNTM" (Matanya):
[   5  78064 ]
[*** -2 -2 -2 ]
[    7675   -24   7579  7675      0 11299 ]
[   11247    13  11247 11299      0 12217 ]
[   12197     5  12197 12217      0 78064 ]
[   17451    13  17451 17503  12197 78064 ]
[   43802    33  43802 43934  17451 46903 ]
[   46839    16  46839 46903  17451 78064 ]
[   59073   -22  58985 59073  46839 78064 ]
[   70939    24  70939 71035  58985 78064 ]
[*** -2 -2 -1 ]
[    1666   -35   1526  1666      0  5835 ]
[    5759    19   5759  5835      0  9925 ]
   .   .   .   .   .   .   .   .   .   .
The 4-5 step reads two (-2,-2,-2) families, one from the first file and the other from the second file, computes the distance and writes a line to the output file. Then it reads two (-2,-2,-1) families, computes the distance and writes another line to the output file; and so on. Each input file contains 125 families; the output file contains 125 lines.
[  -2 -2 -2  2.377977e+005 ]
[  -2 -2 -1  3.450332e+005 ]
  .   .   .   .   .   .
[   2  2  2  2.263009e+005 ]
Though, sometimes an empty family may encounter. Then its counterpart in the other file is ignored, and the corresponding line is missing in the output file.

So, we have two families: the (-2,-2,-2) family from the first file, describing 3 perturbed ELS's matching the word "WHYQDC", and the (-2,-2,-2) family from the second file, describing 8 perturbed ELS's matching the word "HYNTM". How to compare them? We compare each line of one family with each line of the other family, getting 3 * 8 = 24 numbers. (Typically, there are about 100 numbers.) These 24 numbers are just added. Why add them? The numbers are so constructed that they are large when the corresponding ELS's are close. Each pair in close proximity contributes a lot to the sum. Thus, a large sum indicates that there is a close proximity somewhere (maybe, several times). You may say: however, a large sum can be made up of many small contributions. Maybe. As far as I understand, usually, large and small summands differ so much that the accumulation is of little importance. Maybe, instead of adding the numbers we could choose the maximal number. I do not know. Anyway, the sum is called the measure of proximity between two words.

So, the comparison of families is reduced to the comparison of their elements (perturbed ELS's). How to compare, say, the first element "[ 23046 -2037 12861 23046 0 54131 ]" of the first family with, say, the fourth element "[ 17451 13 17451 17503 12197 78064 ]" of the second family? Two factors are taken into account. The first factor is intended to emphasize noteworthy ELS's, that is, ELS's having large domains. The first ELS has its domain [0, 54131], the second [12197, 78064]. Their intersection [12197, 54131] is called the domain of simultaneous minimality for the two ELS's, and its length (decreased by 1), 54131 - 12197 = 41934, is the first factor. (According to the paper [WRR94] it should be normalized by dividing by the length of the Book, 78064, but the program does not divide it, moreover, it multiplies it by 100. I do not know why it does so, but anyway, it is harmless, since all distances are changed proportionally.) It may happen that the domain of simultaneous minimality degenerates into a single point or is empty; then the first factor is set to 1 (and afterwards, still, is multiplied by 100). It is somewhat odd, but harmless. The first factor, described above, is multiplied by the second factor, described below.

The second factor is the most time-consuming part of the WRR algorithm. I'd like to understand it better. The original idea is, to write the text of the Book on a giant cylinder, on which the text spirals down on one long line, and to choose the optimal row length for the two considered ELS's, so that both words fit in a small (that is, as small as possible) region on the giant cylinder. The optimization must be made quickly. You see, it is performed about 100 times for each of the 125 families, all that for a single pair of words! We need something more effective than trying all possible row lengths. The WRR algorithm tries the 20 most promising values of the row length h. These are |d1|, |d1|/2, ... |d1|/10 and |d2|, |d2|/2, ... |d2|/10, rounded to integers; here d1, d2 are skips of the two considered ELS's, and |d1|, |d2| are their absolute values (signs ignored). Each value of h gives a number. As for me, it would be natural, to choose the maximal among the 20 numbers. However, the WRR algorithm adds them.

So, what is the number characterizing the close proximity of two ELS's on the cylinder with a given row length? The following example is borrowed from the paper [WRR94] (see Fig. 2 there); the Hebrew text is written from the right to the left; irrelevant positions are filled by dots:

 ............
 .M.T.N.Y.H..
 ..........C.
 ............
 ..........D.
 ............
 ..........Q.
 ............
 ..........Y.
 ............
 ..........H.
 ............
 ..........W.
 ............
Here h = 3876; one ELS (start=24185, skip=-2) matches "HYNTM", the other ELS (start = 28052, skip = 7752) matches "WHYQDC". Note that h = |d2|/2. Note also that the second ELS was seen as a column in "Step 1-2", while the first ELS is a member of the (0,0,0) family shown on "Data 3". We consider three numbers: this time, the three numbers are: 2, 2, and 21/2. The sum of their squares: 4 + 4 + 2 = 10. The inverse of the sum (1/10) is the number characterizing the close proximity, - the second factor.

Step 4-5 Program code
back to Steps of computation back to Impossible Facts back to my home page