The Hacking of LHa for UNIX (1st draft) (Lha-users) - LHa for UNIX

新井です。

更新、現時点で支離滅裂です。解析の方針を立てても、脇道にそれ
てドツボにはまってます。あまりにも頭の中が整理されてないので、
CVSの方はまだ更新してません。

# 自分の愚かさぶりを公表するだけですが。以下現時点の差分です。

Index: Hackinf_of_LHa
===================================================================
RCS file: /cvsroot/lha/lha/Hackinf_of_LHa,v
retrieving revision 1.3
diff -u -u -r1.3 Hackinf_of_LHa

--- Hackinf_of_LHa	5 Jan 2003 21:56:41 -0000	1.3
+++ Hackinf_of_LHa	7 Jan 2003 21:47:26 -0000
@@ -1296,10 +1296,9 @@
 
 static void match_insert()
 {
-    unsigned int off, h, max;
+    unsigned int off, h;
     unsigned int scan_end;
 
-    max = maxmatch; /* MAXMATCH; */
     if (matchlen < THRESHOLD - 1) matchlen = THRESHOLD - 1;
     matchpos = pos;
 
@@ -1315,10 +1314,9 @@
 
     if (matchlen <= off + 2) {
       off = 0;
-      h = hval;
 
       scan_end = (pos > dicsiz) ? pos + off - dicsiz : off;
-      search_dict(pos, hash[h], scan_end, off+2);
+      search_dict(pos, hash[hval], scan_end, off+2);
     }
 
     insert();
@@ -1367,7 +1365,7 @@
 どうも、文字 text[pos-1] を出力しているだけのように思える。文字の出力
 は、slide 辞書法では「辞書から見つからなかった場合」を意味するから、や
 はり最初の予想はあってそうなのだが・・・仕方ないので、output()の処理を
-除いて見よう。これは、lh5, 6, 7 の場合、huf.c:output_st1(c, p) である。
+覗いて見よう。これは、lh5, 6, 7 の場合、huf.c:output_st1(c, p) である。
 現時点で処理の内容を見てもわけがわからないが、結論から言うと第一引数 c 
 は、文字で、第二引数 p は、位置である。冒頭の decode 処理で、文字 c は
 長さも兼ねていると説明済みなので、(そして、text[pos-1] には現時点で文
@@ -1561,8 +1559,9 @@
                                  / maxmatch{256}              maxmatch{256}
                                 pos
 
-                                    <------------------------------>
-                                               remainder
+                                   <------------------------------->
+                                              remainder
+
        |------- 以前のデータ  ---------|--- 新しいデータ  ---------|
 
 ----------------------------------------------------------------------------
@@ -1596,6 +1595,334 @@
 
 これで、一応は slide.c を網羅する事ができた。まだ不明な点は多いが、デ
 バッガで実際の処理を追いかければまたわかることがあるだろう。
+
+・・・しばし、休息・・・
+
+さて、デバッガでと以前は考えていたのだが、あきらめるのはまだ早い(元気
+が出たらしい)、そもそも最初に「デバッガを使わずにどこまで解読できるか」っ
+と冒頭に書いてたのにたった2回の通読でもうあきらめようとしていた。が、
+ここまで書いてきた本書を何度か読み返したが、まだまだ検討の余地はある。
+
+そこで、次からは matchlen に着目して見直しをはかることにした。matchlen 
+は出現頻度の高い変数でそのたびに予想を覆し混乱していたからだ。
+
+encode() の概観を再掲載しよう。この中で、matchlen がどう変わって行くの
+かを見て行く。
+
+unsigned int
+encode()
+{
+    int lastmatchlen;
+    unsigned int lastmatchoffset;
+
+    /* (A) */
+    init_slide();  
+
+    /* (B) */
+    remainder = fread_crc(&text[dicsiz], txtsiz-dicsiz, infile);
+    encoded_origsize = remainder;
+    matchlen = THRESHOLD - 1;
+
+    pos = dicsiz;
+
+    if (matchlen > remainder) matchlen = remainder;
+
+    /* (C) */
+    hval = ((((text[dicsiz] << 5) ^ text[dicsiz + 1]) << 5) 
+            ^ text[dicsiz + 2]) & (unsigned)(HSHSIZ - 1);
+
+    /* (D) */
+    insert();
+
+    while (remainder > 0 && ! unpackable) {
+        /* (E) */
+        lastmatchlen = matchlen;  lastmatchoffset = pos - matchpos - 1;
+        --matchlen;
+
+        /* (F) */    /* (G) */
+        get_next();  match_insert();
+        if (matchlen > remainder) matchlen = remainder;
+
+        /* (H) */
+        if (matchlen > lastmatchlen || lastmatchlen < THRESHOLD) {
+            /* (H.1) */
+            encode_set.output(text[pos - 1], 0);
+            count++;
+        } else {
+            /* (H.2) */
+            encode_set.output(lastmatchlen + (UCHAR_MAX + 1 - THRESHOLD),
+               (lastmatchoffset) & (dicsiz-1) );
+            --lastmatchlen;
+
+            while (--lastmatchlen > 0) {
+                get_next();  insert();
+                count++;
+            }
+            get_next();
+            matchlen = THRESHOLD - 1;
+            match_insert();
+            if (matchlen > remainder) matchlen = remainder;
+        }
+    }
+}
+
+ざっと、matchlen 絡みの動きを中心に見直してみたのだが別の事に気づいた。
+lastmatchlen, lastmatchoffset である。
+
+これまでは off の位置や matchlen の状態に着目していたのだが、実際の書
+き出しを行っていたのは lastmatch〜 の方だった。つまり、off-1 の位置や
+lastmatch〜 の状態がむしろ「現在」の状態で、off や matchlen は一歩ある
+いは 2 歩先の状態だった。今度は off-1 の状態も意識の中に入れよう。
+
+初期化(ループの前)は < 初期状態 > として書いた図を見て、さっさとループ
+の中に入ろう。(E) だ。
+
+        /* (E) */
+        lastmatchlen = matchlen;  lastmatchoffset = pos - matchpos - 1;
+        --matchlen;
+
+        /* (F) */    /* (G) */
+        get_next();  match_insert();
+        if (matchlen > remainder) matchlen = remainder;
+
+(F) まで処理が進んだ所をイメージして、図に起こそう。hval が pos の位置
+から 3 文字を示している事にも注意する。
+
+----------------------------------------------------------------------------
+< ループ開始 >
+                                
+                                dicsiz=2^dicbit               2*2^dicbit
+                                   v                           v   txtsiz
+       +-------------+-------------++------------+-------------+---+
+  text |             |             ||            |             |   |
+       +-------------+-------------++------------+-------------+---+
+       ^                          /  \                         <--->
+       |                      pos-1  pos                       maxmatch{256}
+ matchpos
+       |----- lastmatchoffset -----|
+                                   <-->
+                                   lastmatchlen
+
+                                   <------ remainder -------------->
+
+                                   |--- この位置に最初の  ---------|
+                                        データが読まれている
+
+< 各変数の値 >
+
+  matchlen = 1
+  matchpos = 0
+  pos = dicsiz+1
+  hval = pos の位置から3文字
+
+  lastmatchlen = 2
+  lastmatchoffset = pos - matchpos - 1 (pos-1)
+----------------------------------------------------------------------------
+
+そして、match_insert() が呼ばれ、以下の条件で一致判定を行っていた。
+
+        /* (H) */
+        if (matchlen > lastmatchlen || lastmatchlen < THRESHOLD) {
+
+じっくり考える。match_insert() では、もし見つかったら matchpos,
+matchlen を更新し、見つからなかったら更新しなかった。(match_insert()の
+検索対象は hval つまり、pos の位置から3文字だ)
+
+もし、見つからなかった場合(今は初期状態なので当然見つからない)、
+
+  matchlen{1} > lastmatchlen{2}
+
+の条件は偽だが、それよりもこの場合は、
+
+  lastmatchlen < THRESHOLD
+
+こちらの条件の方が重要なようだ。lastmatchlen の初期値は 2 (THRESHOLDは 
+3)なのだから match_insert() がどういう結果だろうと、(H) の条件は真だ。
+これは正しい処理(H が真であれば、見つからないの意味だった)。(H) の条件
+の右半分 lastmatchlen < THRESHOLD は、とりあえず初期状態のための条件だ
+と言える。
+
+では、仮に 以前の位置で見つかった場合というのを想定してみよう。仮に
+matchpos を dicsiz の位置に置く、lastmatchlen は 5 とでもしておこう。
+(matchlen は、(E) から 4)
+
+----------------------------------------------------------------------------
+< ループ開始 >
+                                
+                                dicsiz=2^dicbit               2*2^dicbit
+                                   v                           v   txtsiz
+       +-------------+-------------+-------------+-------------+---+
+  text |             |             |             |             |   |
+       +-------------+-------------+-------------+-------------+---+
+                                  /             \
+                              matchpos           pos
+         
+                                    |---   ---|
+                                      lastmatchoffset
+                                   <----->
+                                   lastmatchlen{5}
+
+                                                <--- remainder --->
+
+                                   |--- この位置に最初の  ---------|
+                                        データが読まれている
+
+< 各変数の値 >
+
+  matchlen = 4
+  matchpos = dicsiz
+  pos  = 適当
+  hval = pos の位置から3文字
+
+  lastmatchlen = 5
+  lastmatchoffset = pos - matchpos - 1 (pos-1)
+----------------------------------------------------------------------------
+
+ここで、match_insert() した後、どうなるか？
+
+        /* (H) */
+        if (matchlen > lastmatchlen || lastmatchlen < THRESHOLD) {
+
+仮に見つからなかったとしよう。
+
+   matchlen{4} > lastmatchlen{5}
+
+は偽だ。
+
+   lastmatchlen{5} < THRESHOLD{3}
+
+でもあるから (H) の条件は偽だ？？？・・・そうだった。lastmatchlen に注
+目するのだった。(H.1) で、一つ前(pos-1)の文字を書くのだから、この条件
+判断は「真なら一つ前は一致していなかった」を表す。そして、以前は一致し
+ていたから、match_insert() の結果にかかわらず(H) は偽にならなければお
+かしい。
+
+では、仮に match_insert() で見つかった場合を考えよう。matchlen = 10 に
+なったとでもしよう。(H) は真になる。おかしいことが起こった。(H) の条件
+は「以前は一致していたが今回さらに長く一致した」ということを示している。
+
+(H) の条件はなかなか難解だ。素直に条件式を読んでみると
+
+      if (「一つ前の一致長よりも長く一致した」 ||
+          「一つ前は一致していなかった」) {
+
+そうなのだ。素直にこのとおりの条件なのだ。当初 (H) の意味を「見つかっ
+たか、見つからなかったか？」という条件のはずだと先入観を持っていたがた
+めに混乱していた。もう一つ、気づくのが遅すぎたが、match_insert() は、pos 
+の文字列が見つからない場合、matchlen を更新しないが (E) で、
+
+        lastmatchlen = matchlen;
+        --matchlen;
+
+という処理、つまり、matchlen < lastmatchlen の状態にあらかじめしている
+ことから (H) の条件は、
+
+      if (「(match_insert()で文字列が見つかり)
+            一つ前の一致長よりも長く一致した」 ||
+          「一つ前は一致していなかった」) {
+
+っと「(match_insert()で文字列が見つかり)」という条件が隠れている。以前、
+match_insert() で「見つかったかどうかの状態を関数の外に出力として出さ
+ないのはなぜか？」っと思ったのだが、ここでは match_insert() の呼び出し
+前に「matchlen をあらかじめ見つからなかった場合 に更新していた」の
+だ。・・・むむむ。なかなか意外な処理だった。先入観を持ちすぎていたとい
+うこともあるだろう。
+
+まとめると (H) の条件(の左半分)はこうだ「たとえ、pos-1 の位置の文字列
+が辞書にあったとしても、pos の位置の文字列がより長く辞書と一致するなら 
+pos-1 の文字を平文で出力する」
+
+もう一つここで、先入観があることに気づく。match_insert() は pos の位置
+の文字列を辞書から検索する処理だとここで考えていたが、そうではなかった。
+何やら先読みらしい事をしていたではないか。
+
+もう一回、match_insert() の内容を確認し、(H) の正確な条件を考えよう。
+
+改良版 match_insert() を再掲する。
+
+static void match_insert()
+{
+    unsigned int off, h;
+    unsigned int scan_end;
+
+    if (matchlen < THRESHOLD - 1) matchlen = THRESHOLD - 1;
+    matchpos = pos;
+
+    off = 0;
+    for (h = hval; too_flag[h] && off < maxmatch - THRESHOLD; ) {
+        h = ((h << 5) ^ text[pos + (++off) + 2]) & (unsigned)(HSHSIZ - 1);
+    }
+    if (off == maxmatch - THRESHOLD)
+        off = 0;
+
+    scan_end = (pos > dicsiz) ? pos + off - dicsiz : off;
+    search_dict(pos, hash[h], scan_end, maxmatch);
+
+    if (matchlen <= off + 2) {
+      off = 0;
+
+      scan_end = (pos > dicsiz) ? pos + off - dicsiz : off;
+      search_dict(pos, hash[hval], scan_end, off+2);
+    }
+
+    insert();
+}
+
+最初の
+
+    off = 0;
+    for (h = hval; too_flag[h] && off < maxmatch - THRESHOLD; ) {
+        h = ((h << 5) ^ text[pos + (++off) + 2]) & (unsigned)(HSHSIZ - 1);
+    }
+    if (off == maxmatch - THRESHOLD)
+        off = 0;
+
+この部分、まず、off は、0 から maxmatch - THRESHOLD - 1 (252) の範囲の
+値を持つ。for の中のハッシュ関数から h は、pos+off の位置の 3 文字を表
+す。(ただし、例外がある。h が pos + maxmatch - THRESHOLD の文字列を表
+す場合、off は 0 だ。うむむこの意味はとりあえず無視しよう)
+
+そして、too_flag の内容によって、off の値が決まる。
+
+too_flags[] はどのような場合に更新されるか slide.c 全体を確認してみる
+と、まず初期状態ではすべての要素が 0 だ。そして、update() でも、すべて
+の要素が 0 に初期化される。これらはともに初期状態であり、境界の状態で
+もある。これ意外では search_dict() と名付けた関数で更新されるだけだ。
+この意味を確認しよう。search_dict() に該当する処理を再掲する。
+
+        while (scan_pos > scan_end) {
+            chain++;
+
+            if (text[scan_pos + matchlen - off] == text[pos + matchlen]) {
+               ... 省略 ...
+            }
+            scan_pos = prev[scan_pos & (dicsiz - 1)];
+        }
+
+        if (chain >= LIMIT)
+            too_flag[h] = 1;
+
+LIMIT は、256 だ。256回ハッシュのリンクをたどっても辞書から文字列が見
+つからない場合に、too_flag[h] は 1 になる。つまり、h の文字が一度見つ
+からなかったとわかると、そのことを too_flag[h] に記録しているどうやら
+これは高速化(というよりは最悪の状態を避けるためのもの)の処理のようだ。
+次回、同じ轍を踏むのを避けているものと思われる。確認しようもう一度
+for ループの処理を見る
+
+    off = 0;
+    for (h = hval; too_flag[h] && off < maxmatch - THRESHOLD; ) {
+        h = ((h << 5) ^ text[pos + (++off) + 2]) & (unsigned)(HSHSIZ - 1);
+    }
+    if (off == maxmatch - THRESHOLD)
+        off = 0;
+
+hval が見つからない事がわかっているなら off を進める。
+
+# ん？insert() の処理で、hash[] を更新したとき too_flag[] を更新してい
+# ないのはなぜだろう？前回match_insert() を呼び出して(too_flagを更新し
+# てから)次の match_insert() までには当然 hash[] は更新される。
+# too_flag[hval]==1だとしても、一致文字列が見つからないとは限らないの
+# では？うーん、またくじけそうだ。


で、どうも match_insert() の改造版じゃない方(改造版はoffに関
してまだ不完全)を見ると pos+off の位置の文字列の辞書上の位置 
scan_pos を求め scan_pos - off したものと pos 位置の文字列を
比較しているので、結局比較対象は pos 位置の文字列。
match_insert() はどう転んでも pos の位置の文字列の辞書検索な
のは間違いなさそう。(H) の条件はやっぱりあってるか？

今回の更新分は破棄しようかなあ。

--
新井康司 (Koji Arai)



LHa for UNIX
Fork
lha
olha

[Lha-users] The Hacking of LHa for UNIX (1st draft)

LHa for UNIX Forklhaolha

[Lha-users] The Hacking of LHa for UNIX (1st draft)

LHa for UNIX
Fork
lha
olha