LV6 compression and Codes

Muzozavr · Post by **Muzozavr** » Mon Sep 24, 2012 12:37 am

In case of "several modes of complexity", we will need a header byte then (in the encoding) to determine what mode was used during the compression. Just stating the obvious in case it gets forgotten.

VirtLands · Post by **VirtLands** » Mon Sep 24, 2012 3:41 am

I can imagine there will be header-bytes for all kinds of things.
Header bytes can be like the following:

[...4 header bytes to identify WHAT this file is...][..byte A..][ ... etc... ]

OR

["HUFF"][..byte A..][...HUFFed bits for Objects and 2-byte-Tiles..][..HUFFed 3rd bytes..][..HUFFed 4th bytes..][..HUFFed Signs and other STRINGS..][...etc...]

{ The above way of storing data is only a starter idea and not final. }

Header Byte A consists of appropriate summations of: ___

1 : Use regular Huffman algorithm
2 : Enhance the Huffman algorithm with special functions
4 : Use Semi-Adaptive Huffman algorithm
8 : Use Fully-Adaptive Huffman algorithm
16 : 3rd bytes of Tiles exist, therefore store them (linearly).
32 : 4th bytes of Tiles exist, therefore store them (linearly).
...

In adaptive HUFFMAN the HUFF discards symbols that no longer exist
(become zero-weight) allowing remaining symbols to have even shorter prefix codes.

Definition of Fully-Adaptive Huffman: The HUFFMAN re-balances its tree
at every single symbol encounter to ensure that it always uses the shortest prefix codes (such as "01", "100", "110"...).
Very possible to code this, but the downside is that it very CPU intensive
both to compress and decompress data, (and will take longer to run).

Definition of Semi-Adaptive Huffman: The HUFFMAN re-balances its
tree only when grabbing new symbols (from symbol table)
and when symbols disappear (because their symbol weights become zero.)

Definition of Non-Adaptive Huffman: The HUFFMAN never re-balances its tree. In other words, when a symbol that had a
short prefix code no longer exists, the tree pretends that it is still
there;

My favorite version would be Semi-Adaptive; I believe it's the wisest
choice, but I could do a speed and size contest on all 3 to see how they
compare.

Adaptive Huffman Coding is an advanced topic, you can read more about it on these links:

Binary Essence: http://www.binaryessence.com/dct/en000097.htm
http://www.cs.duke.edu/csed/curious/com ... ehuff.html
Wikipedia: http://en.wikipedia.org/wiki/Adaptive_Huffman_coding

I also plan to add functions to the HUFFman which is something that I've not seen mentioned anywhere on the net.

©

Functions can be as follows:
(=) is a duplicate of previous
(n) Grab a new symbol from symbol table
(>) is one more than previous
(<) is one less than previous
( There might be other worthwhile functions, but I can only think of 4. )

The functions will only be used when their prefix codes are 'shorter' than
the regular method.

Data storing proposal 1: (mentioned above)
The 3rd bytes and 4th bytes of Tiles can be read separately and stored
separately. Later on during decompression they can be rejoined to their lower 2-bytes;
That's very easy to code.
As you reminded, some leading header byte "A" would indicate whether 3rd bytes and 4th bytes exist at all.

Data storing proposal 2:
The 2-byte tiles get read in as SHORTs
and the 3rd byte + 4th byte pair also get read in as a following SHORT.
Supposedly pair-encoding might be more efficient than taking everything
just 1-byte at a time. The SHORTs become HUFFMANed.

Data storing proposal 3:
Popular 2-Byte Tiles get replaced by a 1-Byte representative.
The look-up table becomes part of the compression software
(and no need to store look-up table in compressed code.)
http://en.wikipedia.org/wiki/Lookup_table

Also, I don't know if I already mentioned this, but the symbol table
can also be compressed and stored as HUFFed bits,
-- only the weight table must remain outside of compression
(on the outside of HUFFed bits).

A numeric series of Decoy (or "imposter") symbols can be assumed at the start and "implied" in the weight table.
When the real symbol table is decompressed,
the "decoy" symbol table get replaced by the real one.

S&W = regular symbol and weight table [s:w][s:w][s:w][s:w]....

..and what it looks like after I rip in in half:

[s][s][s][s][s][s].... a series of unique symbols = The S Table
[w][w][w][w][w][w].... a series of unique weights = The W Table

The classical HUFF way: [..huffed data bits..][ S&W Table ]

My improved version: [..huffed data bits..][..huffed S table...][..non-huffed W table..]

Well, this stuff is very technical. I guess that someday I'll actually start it.