summaryrefslogtreecommitdiffstats
path: root/contrib/compressor/algorithm.doc
blob: 74a7646cc2ec47907778b78022b2aabdaaa79dd4 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
The  compressor achieves  an  average compression  rate of 60%  of the
original size which is on par with "gzip". It seems that you cannot do
much better for compressing  compiled  binaries.  This means that  the
break even  point  for using compressed  images is   reached, once the
uncompressed size approaches 1.5kB. We  can stuff more than 12kB  into
an 8kB EPROM and more than 25kB into an 16kB EPROM.   As there is only
32kB of RAM  for both the uncompressed  image  and its BSS  area, this
means that 32kB EPROMs will hardly ever be required.

The compression  algorithm uses a  4kB  ring buffer  for buffering the
uncompressed data. Before   compression starts,  the  ring buffer   is
filled  with spaces (ASCII  character  0x20).  The algorithm tries  to
find repeated  input sequences of a  maximum length of  60 bytes.  All
256 different input  bytes  plus the 58 (60   minus a threshold of  2)
possible  repeat lengths form a set  of 314 symbols. These symbols are
adaptively Huffman encoded.  The  algorithm starts out with a Huffmann
tree  that  assigns equal code lengths    to each of  the  314 symbols
(slightly favoring the repeat  symbols over symbols for regular  input
characters), but  it will be changed whenever  the frequency of any of
the symbols  changes. Frequency counts are  kept in 16bit  words until
the total number of compressed codes totals 2^15.  Then, all frequency
counts will be halfed (rounding to the bigger number).  For unrepeated
characters (symbols 0..255) the Huffman code  is written to the output
stream.  For repeated characters the  Huffmann code, which denotes the
length of the repeated character sequence, is written out and then the
index in the ring buffer is computed.   From this index, the algorithm
computes  the offset   relative to  the current  index  into  the ring
buffer. Thus,  for typical input data,  one would expect that short to
medium range offsets are more frequent  than extremely short or medium
range to long range offsets. Thus the  12bit (for a 4kB buffer) offset
value  is statically Huffman encoded  using a precomputed Huffman tree
that favors  those  offset  values    that  are deemed to   be    more
frequent. The  Huffman encoded offset  is  written to the output  data
stream,  directly  following the code  that   determines the length of
repeated characters.

This algorithm, as implemented in the  C example code, looks very good
and  its operating parameters are   already well optimized. This  also
explains   why  it achieves     compression ratios    comparable  with
"gzip". Depending on the input data, it sometimes excells considerably
beyond what "gzip -9" does, but this  phenomenon does not appear to be
typical. There are some flaws with  the algorithm, such as the limited
buffer  sizes, the  adaptive  Huffman tree  which takes  very  long to
change, if    the input  characters  experience   a sudden   change in
distribution, and the static Huffman   tree for encoding offsets  into
the  buffer.   The slow  changes of   the  adaptive  Huffman  tree are
partially counteracted by  artifically keeping  a 16bit precision  for
the frequency counts, but  this does not  come into play until 32kB of
compressed data is output, so  it does not  have any impact on our use
for "etherboot", because  the BOOT Prom  does not support uncompressed
data of more then 32kB (c.f. doc/spec.doc).

Nonetheless,  these problems  do  not  seem  to affect  compression of
compiled  programs very much.  Mixing  object code with English  text,
would not work too  well though, and  the algorithm should be reset in
between. Actually, we  might  gain a little  improvement, if  text and
data   segments    were compressed  individually,    but   I have  not
experimented with this option, yet.