LV6 compression and Codes

VirtLands · Post by **VirtLands** » Sun Aug 19, 2012 11:31 pm

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Welcome to the LV6 compression Topic,....

In order to save some space, I've shrunk some of these old posts.

To see the most recent post, simply scroll down.

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

The mission is to create an efficient algorithm (or abstract code)
to compress (smash down) an LV6 level as small as possible,

and then convert the resultant bytes into a polite
ASCII sequence for displaying here in ordinary posts.

The result can also contain complex UNICODE sequences as well,
if we wish to get that advanced.

Why?, Because this sequence can be 'copied', and deciphered by an uncompressor given to each of us,
and could possibly do away with "attached levels".

Clues:
The (compressed &) displayed code should avoid containing any control characters
because of the havoc that control characters create when displayed.

Control Characters are such as the following:

characters 0 to 32 (decimal)
and
characters 127 to 160 (decimal)

There may be more control characters than that, I'm not sure.

For some free expert software on displaying all unicodes
download SC Unipad text editor: http://www.unipad.org/main/

The best compression algorithm for compressing anything is
the Huffman algorithm:
http://en.wikipedia.org/wiki/Huffman_coding
https://cs.nyu.edu/~jchen/datastructure ... torial.htm

There are a few variations on the Huffman to make it smash things
down even more; This is an advanced topic for later discussion.

I had a notebook with some ideas of mine that would improve the
Huffman algorithm, ... and I lost the notebook.

It would not be logical to use commercial software such as WinZip, etc,
for the compression phase, because we can compress it even better.

With our knowledge of the LV6 form, we can smash it down even smaller
than regular software can. Trust me, we can.

So far, part of the most challenging part so far, is to create code to
write an arbitrary number of bits to a file, where the number of bits varies and is not always a multiple of 8.

If you need a month or more to think about this, that's fine,
there's no rush to come up with ideas.

Potentially the LV6 can be read directly from the PCs clipboard,
AND it may be slightly altered so that Author or UserName
data is forced into it, (even if the author=unknown).
When the "LV6" is read from a file or clipboard, it will know automatically
which subdirectory within "CustomLevels" to place itself in, and it will do
this automatically without the user going through the whole
"DOWNLOAD-UNZIP-COPY PROCESS."

The "uncompressor" will have to be set up and shown once where the
location is of "CustomLevels", and from there on results will be automatic.

The process can be repeated with CUSTOMTEXTURES, etc.
There is the option of allowing sub directories in
CUSTOMTEXTURES, CUSTOMMODELS, etc.
It can already be done artificially, but not by the default editor.

StinkerSquad01 · Post by **StinkerSquad01** » Mon Aug 20, 2012 1:00 am

So you're saying we want to find a way to compress a .LV6 files into a string of ASCII Characters? That would pretty cool, and eliminate uploading attachments, indeed.

tyteen4a03 · Post by **tyteen4a03** » Mon Aug 20, 2012 4:05 am

Not sure what good this brings. LV6 file itself is already a compressed 2D map with each byte representing a tile. Turning it into ASCII is just a matter of encoding.

However, a .rtwl file would be useful to create a self-sufficient level file that contains all the files needed to run the level.

(and no, what VirtLands is trying to say is LV6 Compression, not a file like .wa2 file with extra models and stuff built-in)

EDIT: Oh, misunderstood your post. You're trying to represent a LV6 file with ASCII. The easiest way to do this would be to use Base64 encoding. Since the extractor is not built into the game, the performance of this extractor is irrelevant.

Muzozavr · Post by **Muzozavr** » Mon Aug 20, 2012 4:24 pm

This seems interesting.

For the C programmers, here is a potentially awesome starting point:

http://bcl.comli.eu/home-en.html

It's a library that has RLE, Huffman, Rice, LZ77 and Shannon-Fano methods implemented.

EDIT: Also, even though it requires a method not implemented in the library:

The .LV6 format encodes tiles and objects separately. That's why an object (example: rainbow coin) and a tile (example: ice) can co-exist on the same spot in the level.

While the objects are encoded with one byte for each, the tiles are encoded with two bytes each. AKA byte pairs.

And in a real level, they aren't random, since certain tiles (example: walls, floors, etc) are likely to occur over and over in any given level. So it makes a lot of sense to compress the TILE part of the .LV6 file with byte-pair encoding.

Here's a BPE library I've found:
http://scz-compress.sourceforge.net/

Apparently it also solves the "no free bytes" problem that plagues normal BPE encoders.

VirtLands · Post by **VirtLands** » Mon Aug 20, 2012 8:11 pm

It's so nice to see a plan come together with the replies..

I was inspired by the way the N Game shares files by letting people
copy "the code" straight from the post.

I was thinking that we may be able to do the same stuff, and yet
take it a step further -- making it more efficient by compressing it.

Notice here in Bl3wolf's level that it takes 1672 bytes to share the 'Broken spaceship' level.
http://www.nmaps.net/216087

Any file stored on a disk takes more space than the actual file contents.
"Broke SpaceShip.TXT" is actually 1672 bytes; IF we were to to store it as a file on a drive then it would take 4096 bytes.
This further proves my point that sharing levels within a post as Text is brilliant.
___________________________________________________________________________________________________________________

In Exhibit A, I loaded {Frankie.LV5} into the hex editor.

Frankie.LV5 = 3969 bytes long.

My estimate is that with a compression routine
we can make Frankie about 32 times smaller, resulting in ~ 100 bytes perhaps.

(A program like Winrar or WinZip will only compress it into 1240 bytes!)

And then there's the other complication of making sure that the resultant
compressed code (bytes) contains no whitespaces or control characters,
so we'll have to tweak that TXT so that it looks proper.

Notice all the zeroes in the screenshot, one of the first steps to compress it would be to chop away a lot of the zeroes,
do more more technical tricks, and later apply something like the Huffman routine.

A lot of these Tiles and Objects are stored in INTs when they can maybe just be stored in a byte.

Exhibit A:

tyteen4a03 wrote:Not sure what good this brings. LV6 file itself is already a compressed 2D map with each byte representing a tile..

An LV6 uses 4 bytes to represent each tile, and 4 bytes to represent each object, [4 bytes=1 int].

tyteen4a03 wrote:However, a .rtwl file would be useful to create a self-sufficient level file that contains all the files needed to run the level.
The easiest way to do this would be to use Base64 encoding

Sounds like a good idea; An idea of mine was to keep the same LV6 format and append data to the end of the file.
As for the Base64 stuff, well that has possibilities too.
Would that avoid control-characters?

StinkerSquad01 wrote:So you're saying we want to find a way to compress a .LV6 files into a string of ASCII Characters? That would pretty cool, and eliminate uploading attachments, indeed.

Yup, that's the ultimate plan.

One of the challenges is making the Huffman code itself,
and then there's the little problem of sending variable length
bit sequences to a file.

In this example, the Huffman algorithm outputs variable length
bit sequences, which are not always groups of 8. [8 bits=1 byte]

a stream of bits:

1..01010..100..001..10100000..011..

So, for example, first a '1-bit' is spit out, then '5-bits', then '3-bits', and so on, and if the total number of bits is not a multiple
of 8, then some 'padding' will be added.

The Huffman algorithm works by counting identities, and weights, ..

http://www.ccs.neu.edu/home/jnl22/oldsi ... /jeff.html
http://mathworld.wolfram.com/HuffmanCoding.html
http://en.wikipedia.org/wiki/Huffman_coding

"Huffman coding works on a list of weights {w_i} by building an extended binary tree with minimum weighted external path length and proceeds by finding the two smallest ws, w_1 and w_2, viewed as external nodes, and replacing them with an internal node of weight w_1+w_2. The procedure is them repeated stepwise until the root node is reached. An individual external node can then be encoded by a binary string of 0s (for left branches) and 1s (for right branches). "

Muzozavr wrote:This seems interesting.
For the C programmers, here is a potentially awesome starting point:

Thanks for the reference C code. I currently plan to do everything in Blitz3D.

Muzozavr · Post by **Muzozavr** » Mon Aug 20, 2012 8:17 pm

Can you share all the inner details of the LV6 format that you know of? I thought that tiles are encoded with two bytes (for example: A5:08, I don't know what that is... if it IS anything) and objects with one (like 07 or 2B)

Also, there's this huge "Stinky and Loof Level file" header -- if we do actually make a program in the end, can't we just strip that header away, so that the decompression program puts it back? For a simple header, it IS pretty beefy.

Also Huffman coding (or any other entropy coding system) should be applied LAST in the list. If you take a piece of data and do Huffman coding on it, the result will be harder to compress with other algorithms.

VirtLands · Post by **VirtLands** » Mon Aug 20, 2012 10:56 pm

Muzozavr wrote:Also Huffman coding (or any other entropy coding system) should be applied LAST in the list. If you take a piece of data and do Huffman coding on it, the result will be harder to compress with other algorithms.

Nothing in an LV6 file is stored as a BYTE or SHORT, everything is
stored as INTs, where an INT=4 bytes.

The way strings are stored [in LV6s] is as follows:

First comes the length of the string, [stored as an INT],
followed by the characters, [bytes]

As an example: the string "cat" is stored as follows: (the length of "cat"=3)

When a string is NULL (does not exist) it is simply stored as [00 00 00 00].

Also, the level's INTs are stored in Little-Endian fashion where the
least-significant byte is stored first, and higher bytes follow
.
http://people.cs.umass.edu/~verts/cs32/endian.html

As an example, in Frankie.LV5,

the width of the level =14(decimal) = 0E(hex)
the height of the level =25(decimal) = 19(hex)
Notice that in the hex editor, that the 0E is followed by 3 zero bytes, which means that 0E is stored in Little-Endian mode.
The 3 zeroes that follow means that 0E is an integer.
Next comes 19(hex) = level height, also stored as an INT.

And, following the 19(hex) is [C8 00 00 00] which is the code for
WALL 1, and so on.

Muzozavr wrote:Also, there's this huge "Stinky and Loof Level file" header -- if we do actually make a program in the end, can't we just strip that header away, so that the decompression program puts it back? For a simple header, it IS pretty beefy.

I think it would be perfectly fine to compress the header too, if
you can find some clever way to do it.

In Summary, the data in an LV6 is stored in this order:
__________________________________________________

ID STRING: "Stinky & Loof Level File v6", (or v4, or v5)
FileName STRING: "FRANKIE" (without the .LV6)

INT: (mysterious 4 bytes that I have never deciphered,
it's probably a randomly generated serial # used to make the file "unique".)

Title STRING: "Visiting Frankie"

INT: always set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomHouse

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomModel

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomTexture

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomBackground

INT: read in Timer seconds (24000 seconds = 10 hours

)
INT: read in Style, (00 - 09) = { Cave,Sand,Wood,Purple,Castle,Jade,Spooky,Garden,Aztec,Custom }

INT: read in Background, (00 - 0A) ={ Sky,Forest,Walls,Stars,Flat,Water,Lava,Warp,City,Rainbow,Custom }

INT: level width
INT: level height

...followed by a 2D (INT) array of Tiles
...followed by a 2D (INT) array of Objects

Followed by an array of 20 SIGN strings:
SIGN$[1], SIGN$[2], SIGN$[3], SIGN$[4]....

Remember that when a SIGN does not exist it is simply stored as [00 00 00 00]

The position of the data relates to the SIGN number, so
sign 1 will be stored first, followed by sign 2, followed by sign 3...

..and next is the MUSIC INT, which is the last data in the file.

INT: Music
____________________________________________________________

Here's a Blitz3D program I made a Blitz3d to read LV6 files.
( It also reads LV4 and LV5, the format is the same. )

It is the SOURCE CODE.
You'll need to modify the variable FileIN$
so that it studies the LV6 file of your choice.
You'll need to download Blitz3D demo to run it:
http://www.blitzbasic.com/file/get.php? ... mo1101.exe
or
http://www.blitzbasic.com/Products/blitz3d.php

Muzozavr · Post by **Muzozavr** » Tue Aug 21, 2012 12:52 am

I'm totally not a programmer, so I can't implement any of the stuff that I'm going to say here. I don't even know how hard it is to implement anything below.

The "always there" parts of LV6 could be taken away from the file and put into the program itself, but I'm assuming that (other than the header, maybe) we cut as little as possible on this phase 1 of the plan.

Also byte-pair encoding is usually done in multiple passes, while many other algorithms are single-pass.

Comments in BOLD.

VirtLands wrote:*snip*

C8 00 00 00 which is the code for
WALL 1, and so on.

So since you always gave us hex mutations in the form of byte pairs, I take it that the third and fourth bytes are always zero? What happens if they aren't zero?

Same for objects, actually. You gave us hex codes as single bytes. So for objects, the last three bytes are all zero? What happens if they aren't zero? (I'll return to this with the keyphrase "assuming zeroes")

*snip*

In Summary, the data in an LV6 is stored in this order:
__________________________________________________

ID STRING: "Stinky & Loof Level File v6", (or v4, or v5)

Since it's identical for all levels (except for the version number) I suggest we remove it from the file completely and replace it with some short-short header (like "KZX") so that the program knows where to put the header back.

FileName STRING: "FRANKIE" (without the .LV6)

Keep as-is. We can't make any assumptions and the string is too short for general algorithms.

INT: (mysterious 4 bytes that I have never deciphered,
it's probably a randomly generated serial # used to make the file "unique".)

Keep as-is. They are probably needed for score-keeping. We don't want any problems.

Title STRING: "Visiting Frankie"

Most likely, keep as-is. No assumptions to be made and probably too short for general-purpose algorithms. If the title is long, maybe something can work, but I doubt it.

INT: always set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomHouse

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomModel

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomTexture

INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomBackground

What happens if it's set to something else? If nothing (other than a MAV), then they are essentially four bits. So we can put all four INTs in a single BYTE. The names of custom houses/models/textures/backgrounds are to be kept as-is.

INT: read in Timer seconds (24000 seconds = 10 hours )

Keep as-is. Due to hex-editing, we can't assume it's convertible to a byte value.

INT: read in Style, (00 - 09) = { Cave,Sand,Wood,Purple,Castle,Jade,Spooky,Garden,Aztec,Custom }

What happens if you put anything higher than 09? If nothing, then make it BYTE.

INT: read in Background, (00 - 0A) ={ Sky,Forest,Walls,Stars,Flat,Water,Lava,Warp,City,Rainbow,Custom }

For some reason, 0B is also stars. 0C and anything higher is pure black. Make it BYTE.

INT: level width
INT: level height

I know, hex-editing, "we can't assume" and all... but do those two *really* need to be INT? Can anyone ever need anything above 255x255?

...followed by a 2D (INT) array of Tiles

Assuming zeroes, one could halve the size instantly by removing these extra zeroes. Then the INTs get converted to byte pairs. Then we can do a modified RLE that works with a byte pair instead of a byte. Or the array can be subjected to byte-pair encoding. I don't know what would be more efficient. We could do both, but then I don't know what order is best.

...followed by a 2D (INT) array of Objects

Assuming zeroes, one could make this 1/4 of the original size simply by removing these extra zeroes. There's still going to be a lot of zeroes left to signify a "null" object. Since object rows are also common, a standard RLE would probably be a good idea. Afterwards, though, I'm at a loss for anything sneaky. Maybe it will be time for the huffman then?

Followed by an array of 20 SIGN strings:
SIGN$[1], SIGN$[2], SIGN$[3], SIGN$[4]....

Remember that when a SIGN does not exist it is simply stored as [00 00 00 00]

So it's a bunch of zeroes then? Wow, that's really stupid. Easiest solution is to RLE all the zeroes. Maybe I can think of something more efficient later.

The position of the data relates to the SIGN number, so
sign 1 will be stored first, followed by sign 2, followed by sign 3...

If one can figure out a compact way to write which sign is which, you could group all existing signs together even if it's something like 3, 4, 7, 9 and 19. Then all the zeroes are together and RLE becomes AMAZING for this portion.

..and next is the MUSIC INT, which is the last data in the file.

INT: Music

Value "0" does not always default to track 6. If you play such a level before playing any others, it will be silent. Used by Mark in "The Song of Silence". So the two values have to be distinguished. Again, needs to be made BYTE.

(Also, since there are only seven distinct values, one could put this in the same byte as the four INTs for custom houses/models/textures/backgrounds. Then we'll need 7 bits out of 8 in that byte.)

*snip*

And afterwards, we're huffmaning the whole thing and then doing some conversion to an ASCII string. Right?

That's my quick rundown, anyway... I hope that someone else joins the discussion and improves this.

Additions:

1) Since our compression format is going to be quite screwy, we need some sort of a separator symbol (or string) that separates different parts of this structure. That separator has to be something that cannot appear anywhere else in the level. I propose an INT separator that looks like this:

[any-byte-that-corresponds-to-a-control-character][FF][AA][DD]

Control characters cannot occur in an ASCII strings, so we can be safe from the separator appearing in the level title or somewhere else like that. As for the rest of the format, as far as I understand, FFAADD is a byte triplet that can't appear anywhere in a working (non-MAVing) level. And it's easy to remember.

(And since we're doing the ASCII conversion on a later stage, the control character won't be a problem)

2) One could make an optional flag for lossy compression with "Lose decorative object variations". Then all walls/mushrooms/lampposts/etc would become WALL 1 and all floors would become FLOOR 1. (Exception: the invisible "wall X" is often used as part of custom models, so it stays as-is. "Deep wall" stays as-is, too, for obvious reasons) This would make byte-pair encoding much more efficient for those parts.

3) Another option is to trim away text from signs that aren't actually in the level. If you put a sign on a level, enter some text in it and then remove the sign, the text stays in the file, which, IMO, is kinda weird. Some people might want to embed information in this way, though. So this needs to be optional.

4) In "The Reader Is Warned" Mark has edited the structure of one of his signs to make the game crash as soon as the player tries to read it. We have to keep this possibility in mind. Actually, in general, we should try to explore what happens if certain format specifications are violated with hex-editing. If the result loads at all, we have to keep it in mind.

5) For RLE, I just found out about its PackBits version, which might be better for some of our situations: http://en.wikipedia.org/wiki/PackBits

Muzozavr · Post by **Muzozavr** » Tue Aug 21, 2012 10:49 am

Double post, but the first post is unreadably long enough as it is.

For efficient "binary to ASCII" conversions I've found these:

1) base91: http://base91.sourceforge.net/ (they say the overhead is 14-23%)

2) b-news: http://b-news.sourceforge.net/ for algorithm description, http://bnews-plus.sourceforge.net/ for a version that's somehow better. (they say the overhead is only 3.5, but read on the first site for its drawbacks -- I don't know if it's suitable or not)

tyteen4a03 · Post by **tyteen4a03** » Tue Aug 21, 2012 3:41 pm

Seeing how this is going, the correct term for this project would be serialization.

The mysterious INT you have mentioned may be a magic byte, or a separator.

Muzozavr · Post by **Muzozavr** » Tue Aug 21, 2012 6:06 pm

I think Virtlands wants to first compress the file (so that it becomes much smaller) and only then serialize it by converting it to an ASCII string.

VirtLands · Post by **VirtLands** » Tue Aug 21, 2012 11:20 pm

tyteen4a03 wrote:Seeing how this is going, the correct term for this project would be serialization.
The mysterious INT you have mentioned may be a magic byte, or a separator.

Hmm,.. "magic byte" seems very plausible, and I think "separator", less likely.

One thing I noticed about those 4-bytes is that every time you save the level (LV6), those 4 bytes always change,
even if nothing else changes in that level. But does simply playing the level change those bytes too? I forgot to check that.
My suspicion is that it is some kind of random serial number planted there to make the file 'unique' and different from a 'copy'.
I doubt that it is a 'time-stamp', I already check that, and it doesn't seem to be a time-code.

Also, I think it's best to compress an 'identical' copy of an LV6, rather than just a 'healthy remake'.
The plus side is that if the CRC, (cyclic redundancy check) is preserved,
then this helps in later detecting duplicates.

"Duplicate-detection" software would mark them as different if even 4 bytes differ.
For practical purposes, if two levels have identical blocks of the Tiles & Objects,
and differ only in Header, texture, music, etc., then perhaps they can
be classified as "nearly identical".

Muzozavr wrote:Double post, but the first post is unreadably long enough as it is.
For efficient "binary to ASCII" conversions I've found these:
1) base91: http://base91.sourceforge.net/ (they say the overhead is 14-23%)
2) b-news: http://b-news.sourceforge.net/ for algorithm description, http://bnews-plus.sourceforge.net/ for a version that's somehow better. (they say the overhead is only 3.5, but read on the first site for its drawbacks -- I don't know if it's suitable or not)

That Base91 concept is so clever, I did not even know that existed, that someone else already thought of
-- converting binary to a displayable ASCII.
Thanks for the link; I'm not sure that overhead even matters for something as simple as a Blitz3D program. I will gradually read into it,
and see if Base91 makes this project more complex or less.

Three characters were banished from the list, which are:
- (dash, 0x2D), \ (backslash, 0x5C), ' (apostrophe, 0x27)
it looks like white spaces are also banished, need to avoid the 'space' and 'tab' too.

By the way, this translation code can be made very easily in Blitz3D (or also C); It's as simple as making a 'lookup table' (array) .
Now, why someone chose 91 elements to preserve is something to wonder about.
What if I create a 256 element code, to exactly replace each byte with something else?

I'll come back to this subject and study it later.

Just thought of this, ... if we would want to add some 'style' or 'flair'
to the TEXT CODE, then we could also use UNICODE, it's just a thought.

Muzozavr wrote: The "always there" parts of LV6 could be taken away from the file and put into the program itself, but I'm assuming that (other than the header, maybe) we cut as little as possible on this phase 1 of the plan.
Also byte-pair encoding is usually done in multiple passes, while many other algorithms are single-pass.

The "always there" parts can be "assumed", yup, that's a sharp idea.
Okayyy, I don't know what 'byte-pair' encoding is, but if you say so.

If by that you mean replacing a common 2-byte sequence with a 1-byte "stand-in", then, yeah.

Why would we need "multiple passes" ? ?

Muzozavr wrote:So since you always gave us hex mutations in the form of byte pairs, I take it that the third and fourth bytes are always zero? What happens if they aren't zero?

Same for objects, actually. You gave us hex codes as single bytes. So for objects, the last three bytes are all zero? What happens if they aren't zero? (I'll return to this with the keyphrase "assuming zeroes")

Great observation; I was being absent-minded and almost forgot about it.

It appears that most Tiles can and must fit into 2-bytes..

I don't think there have ever been any 3-byte Tiles, except for the
irrelevant few that cause MAVs. (Memory-access-violation).

Take as an example the 'water bridges': ( data is Little-Endian. )

Water Bridge #1 = 58:02 = 600 (decimal) :

Water Bridge #2 = 59:02 = 601 (decimal) :

Water Bridge #3 = 5A:02 = 602 (decimal) :

We'll have to take into account that many common Tiles are 2-bytes,
and since 600 does not fit into 1-byte, then two bytes are needed.

Which brings up the complex possibility of inventing a 'substitution table' that replaces 2-byte Tiles with 1-byte.
Oops, that wouldn't work either because there are more than 255 Tiles!.

But, maybe we can make a dynamic decision on that aspect during HuffMan time.

As for objects, there have never been any 2-byte objects. All Objects have been 0-255.
And, statistically, the most common object is the 0-object.

____________________________________________________________
INT: always set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomHouse
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomModel
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomTexture
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomBackground

Muzozavr wrote:What happens if it's set to something else? If nothing (other than a MAV), then they are essentially four bits. So we can put all four INTs in a single BYTE. The names of custom houses/models/textures/backgrounds are to be kept as-is.

That's a great idea, (you're thinking like a programmer).
You just compressed 4 bytes into a nibble. I'm convinced that it is a
TRUE/FALSE field, with usual 0 or 1 settings. So, I opened up a test level
in the hex editor and replaced the 1 with a 2, ... then played it in the 'game'. Nothing out of the ordinary happens.
So, the "convert 4-bytes into a nibble idea" is great, and will use that if/when I get this program started.
If there are any other 'bytes' out there that can be turned into 'bits', then that is awesome.

Muzozavr wrote:INT: read in Timer seconds (24000 seconds = 10 hours Smile)
Keep as-is. Due to hex-editing, we can't assume it's convertible to a byte value.

In this case, since the Timer can go well beyond 255 seconds,
we can do something crazy and maybe use varints to store it.

255 seconds = (hex)FF = 1 byte = 4.25 minutes.
65535 seconds = (hex)FFFF = 2 bytes = 1092.25 minutes ≈ 18 hours

A varint is a variable length int: (base 128 varint, for example)
https://developers.google.com/protocol- ... s/encoding
http://intra.csb.ethz.ch/javadoc/metabo ... arInt.html

Muzozavr wrote:INT: read in Style, (00 - 09) = { Cave,Sand,Wood,Purple,Castle,Jade,Spooky,Garden,Aztec,Custom }
What happens if you put anything higher than 09? If nothing, then make it BYTE.
INT: read in Background, (00 - 0A) ={ Sky,Forest,Walls,Stars,Flat,Water,Lava,Warp,City,Rainbow,Custom }
For some reason, 0B is also stars. 0C and anything higher is pure black. Make it BYTE.

We can do:

(style SHL 4) OR (background)

take the style and store it as a 'high-nibble' and add the background
stored as the 'low-nibble', ... the result is a 'byte'.
( In this case OR can be interpreted as binary addition. )

Muzozavr wrote:I know, hex-editing, "we can't assume" and all... but do those two *really* need to be INT? Can anyone ever need anything above 255x255?
...followed by a 2D (INT) array of Tiles

There will be a few strange cases where people will make levels
as large as 300 in width, and like as small as maybe 2 in height.
So, I'm hesitant to store dimensions as a byte.

If we can't think of anything more clever, then place it in 3 nibbles,
no level will ever achieve a 4095 width! .

Quick bytes and nibbles (some food for thought):

1 byte = 1 byte : max = 255
1 short = 2 bytes : max = 65535
1 int = 4 bytes : max = 4294967295
1 nibble = ½ byte : maximum = 15
2 nibbles = 1 byte : max = 255
3 nibbles = 1½ bytes : max = 4095
4 nibbles = 2 bytes : max = 65535
5 nibbles = 2½ byets : max = 1048575
6 nibbles = 3 bytes : max = 16777215
7 nibbles = 3½ bytes: max = 268435455
8 nibbles = 4 bytes = 1 int : max = 4294967295
9 nibbles = 4½ bytes = 1⅛ INTS : max = 68719476735
10 nibbles = breakfast.

Muzozavr wrote:So it's a bunch of zeroes then? Wow, that's really stupid. Easiest solution is to RLE all the zeroes. Maybe I can think of something more efficient later.

I had this idea where all the strings may be 'packed' together into a
separate 'section' or 'structure',
and can convert the Blitz3D string format ([00 00 00 00]) into an ASCIIZ format.

An ASCIIZ format is simply a string terminated by a zero character (null).
So, there's no need to store the 4-byte length.
http://www.powerbasic.com/support/help/ ... trings.htm

Since all the strings shall be packed and HUFFMANed together, we
take advantage of the fact that it's all English language.

In summary, when applying the HUFFMAN effect, then do it separately
for different parts of the (LV6) level. In other words, there will
be separate "identity" and "weight" tables; The HUFF can be separately
applied to the STRINGS, and then applied to the Tiles/Objects.
Okay, it's just an idea.

The position of the data relates to the SIGN number, so
sign 1 will be stored first, followed by sign 2, followed by sign 3...

Muzozavr wrote:If one can figure out a compact way to write which sign is which, you could group all existing signs together even if it's something like 3, 4, 7, 9 and 19. Then all the zeroes are together and RLE becomes AMAZING for this portion.

Oh yeah, I'm waiting for you to come up with that solution to that dilemma, that's a brain strainer, .

because you want to clobber those zeroes.

If you clobber the zeroes, then you still have the SIGN data, but you lost their 'positions'.

-------------------------------------------------------------------------------
I don't think RLE works well in data where there is little duplication.
Something as simple as this sentence has no "runs" of duplicates.

RLE
http://en.wikipedia.org/wiki/Run-length_encoding

The Huffman Code/algorithm is lots of fun once you understand it.
It can be expanded upon, and you can go beyond the norm.
Though traditionally it is only made up of "identities" and "weights",
you can also add a few functions into it, such as

(a) : "is same as previous number" (=)
(b) : "is one more than previous number" (>)
(c) : "is one less than previous number" (<)
(d) : "is a brand new identity, (then grab it from data table) "
(e) : "" (...and some more ideas probably...)

There's also a way where it's adaptive, so that each time
the weight of a certain identity becomes zero, then the entire HUFFMAN tree changes, (and becomes more efficient).

I like to refer to the HUFFMAN parts as "Identities" and "weights",
If you look in Wikipedia, they also call them "data" and "frequencies".
It's the same, we all use our own language.
http://en.wikipedia.org/wiki/Huffman_coding

Attached here is a prototype BB source code (Blitz3D) program that I
made to study exactly this -- Huffman compression on LV6 { Jan 2012}.
This program focuses on Tiles and Objects only, and it completely skips
the header and stuff like that.
What it does is simply read the bytes, and then it generates
HUFFMAN strings from it, such as "0000", "0001", "0010", "0100".

Ultimately these binary strings will need to be written to a destination
file, but I haven't gotten that far yet.

I know, it's weird.

It's the source code; you can compile it and run it with the Blitz3d Demo.
http://www.blitzbasic.com/Products/blitz3d.php

In order to make it work, you shall need to change a few
variables in the program so that it matches your own Directory system.

Const source_dir$ = "C:\wonderlands\rtw"
LV6_in$ = source_dir + "CustomLevels\LevelStudy.LV6"

VirtLands · Post by **VirtLands** » Wed Aug 22, 2012 12:36 am

Here are explanations for the sample
LV6_Compression_v1.4 program.

The Objects in a sample level called LevelStudy.LV6 were
counted and HUFFMANed by this program.

The more numerous data have shorter "prefix codes".
In the sample LevelStudy.LV6, the most numerous Object is the "none-object" ( or "object-0" )

When there's nothing there, the lack-of-object is also recorded.

Therefore,...
weight=166 and Identity=0, and "prefix code" = "1"

The most numerous objects have the shortest prefix codes,
and the least numerous objects have the longest codes.

There were 166 instances in the Object section where there was nothing there.

You can prove this to yourself with this math:
None_Objects = Area - Real_Objects
166 = 14x14 - Real_Objects
166 = 14x14 - 30

Next,... we see that

there are 3 of Object-9, (=wooden box) : prefix code = "0000"
there are 3 of Object-10, (=metal box) : prefix code = "0001"
there are 3 of Object-11, (= / reflector) : prefix code = "0010"
and so on...

If this compressed LV6 data were to be written to a file
then it would look sort-of like this:

0-0-0-0-0-0-0-0-0-0-0-0-0-0 (14 zeroes because the width of the level=14, each digit is a bit)
0-0-0-0-0-0-0-0-0-0-0-0-0-0 (and again, for the 2nd line)
0-011000-0-01101-01101-01101-0-0001-0001-0001-0-0-0-0

...and so on...

in this case,
01101, represents a rainbow coin
011000, represents Sinky

There is only one Stinky, so Stinky's prefix-code is 6-bits long.
The price to pay for being infrequent is having a longer prefix code.

Oh, in summary, the total length in bits
of the compressed code (of the 14x14 board) = 196 bits.

Which means that to compress this 14x14 arena of Objects takes only 24.5 bytes,
= 25 bytes (with padding), which is about a 1/8 compression ratio. Can it be shrunk even more? I don't know.

Internally, in the LV6, the Objects take 14x14x4 = 784 bytes to store,
since they are stored as INTs,

So, the compression technique is a miracle, since 25 bytes is much less than 784 bytes.

That's about a 1/32 ratio.

Muzozavr · Post by **Muzozavr** » Wed Aug 22, 2012 5:48 am

The Huffman algorithm is indeed epic

but should be used last in the order.

If by that you mean replacing a common 2-byte sequence with a 1-byte "stand-in", then, yeah.

Yes, I do.

Why would we need "multiple passes" ? ?

Because it's one of the few algorithms for which multiple passes is actually useful. Most algorithms reach their minimum size the very first time. This one doesn't. After the first time, you can find the new most common byte pair and shrink it all again... of course, since you also need to build a "table" to be able to decompress it all back, it eventually does reach its minimum size... but not on it's first pass.

There will be a few strange cases where people will make levels as large as 300 in width, and like as small as maybe 2 in height.
So, I'm hesitant to store dimensions as a byte.

Then maybe in a varint? I didn't even know varints existed until you told me.

I don't think RLE works well in data where there is little duplication.
Something as simple as this sentence has no "runs" of duplicates.

I didn't plan on using it on the sign text. I merely wanted to put all the "empty" signs together and then RLE away the zeroes. Since the text data of an empty sign is 00 00 00 00.

Imagine we have 5 signs with text. 15 signs have those zeroes! Then putting together all 15, we could have 60 zeroes in a row.

That would allow you to do RLE and simply write down something like "60 00" which means "60 zeroes". Wouldn't that be epic?!

But yeah, needs a plan to write down sign positions efficiently.

*********

For ASCII conversion:

Three characters were banished from the list, which are:
- (dash, 0x2D), \ (backslash, 0x5C), ' (apostrophe, 0x27)
it looks like white spaces are also banished, need to avoid the 'space' and 'tab' too.

They took 94 most "problemless" characters and used three of them to build the base91 alphabet itself -- the separators and commands. That's why it's 91.

I think, though, that we could use most of extended ASCII, which is what the b-news encoding does, I think: sure, half of the characters will display as something different for everyone, but the ASCII code remains the same and that's what matters in the end. But I really don't know if it's as workable as I think it is.

As for overhead, it is merely a statistic that tells you how much larger the ASCII string is compared to the original binary sequence. Sure, base64 or uuencode are the easiest ways out for the ASCII conversion, but they have an unreasonably large overhead. And since you wanted compression... "no, just no".

tyteen4a03 · Post by **tyteen4a03** » Wed Aug 22, 2012 6:55 am

In reply to all messages:

A magic byte has to be a static byte. It is normally used to identify a file.

Think of byte-pair encoding as factorization - you don't get the most simplest form on your first pass (in most cases)

About 2/3 tiles inside an int: That would mean that the first 2 byte is used as tile name, while the other 2 byte is used as a tile descriptor.

It doesn't mean there is no hope though: We can shrunk down the size that is used to represent both of the worlds: 2 bytes can represent 2^16 = 65535 unique set of data. Now because tile descriptors will probably never grow to that level, we can use only 1 byte to represent it.

About the True-False idea by Muzo, the correct technical term is bitfields.

About RLE: Where are we heading again? Compress the LV6 file into a most compact form or make a new format that can turn itself back into a LV6 file?

Compression methods doesn't bring good results when all elements are pretty much different (which is like 80% of the levels ever made). Have you tried compressing these kind of levels to see what good it brings to the level?

Muzozavr · Post by **Muzozavr** » Wed Aug 22, 2012 7:20 am

Compress the LV6 file into a most compact form or make a new format that can turn itself back into a LV6 file?

As far as I understand, it's both. If you just try to do the latter, the result will be extremely huge.

Think of compressing as a stepping stone for the conversion.

VirtLands · Post by **VirtLands** » Thu Aug 23, 2012 8:36 pm

Why would we need "multiple passes" ? ?

Because it's one of the few algorithms for which multiple passes is actually useful. Most algorithms reach their minimum size the very first time. This one doesn't. After the first time, you can find the new most common byte pair and shrink it all again... of course, since you also need to build a "table" to be able to decompress it all back, it eventually does reach its minimum size... but not on it's first pass.

That's fascinating; I assume that after the 3rd iteration it will no longer
shrink the code, that around the 4th it will start adding bytes instead.

Will HUFFMAN shrink things down even after 10 generations? I don't think so, (I could be wrong.)

In the resultant (compressed) code, if you analyzed the bytes (or bits)
closely, they will appear to be "unrelated random numbers".

In the quest to shrink things down even more, I came up with yet
another strange idea (it's science-fiction,

and I don't know if it will work.)
We can "shake up the bits", till patterns emerge allowing for yet more compression. So here's how>.

Use random numbers to our benefit.
Blitz3d has a convenient built-in random number generator as such:

Rand ([low value],high value)

and the random numbers also have a seeder function:

SeedRnd A,
where A is any reasonable INT.

Now for the same seeder 'A', the Rand function will always generate the same sequence of random numbers, no matter what.
These will be 'predictable' random numbers.

some example and proof, the following code >>>

SeedRnd 13

For z = 1 To 20
Write Rand(0,255)+" "
Next

will always generate this sequence:
147, 118, 150, 35, 164, 206, 118, 52, 209, 104, 234, 36, 136, 112, 238, 92, 4, 24, 148, 249, . . . . .

So, an entire series of random numbers, (as many as you wish),
can be represented by just one 'seed'. (A)
In coding, we don't need to store the near-infinite series, we only need to store one little seed.

By near-infinite, I meant that there is a possibility that the series will repeat itself, which is fine, no harm there.

AND, here's the idea, ... take a series of 'predictable' randoms and
XOR them with the compressed code (HUFFMANed) bytes;
This may shake things up enough to allow repeating bytes or patterns to emerge again, allowing for 'more compression' .

The seeder 'A' can be a byte (or an INT) carefully chosen so that it yields the most compression possibility (after the XOR event).
And this can be applied multiple times, till finally compression happens.
------------------------------------------------------------------------------
If there are other ways to shuffle the (HUFFMANed) code bits around
in a defined pattern, (which is reversible), then that would be good too.
------------------------------------------------------------------------------

tyteen4a03 wrote: About 2/3 tiles inside an int: That would mean that the first 2 byte is used as tile name, while the other 2 byte is used as a tile descriptor.

It doesn't mean there is no hope though: We can shrunk down the size that is used to represent both of the worlds: 2 bytes can represent 2^16 = 65535 unique set of data. Now because tile descriptors will probably never grow to that level, we can use only 1 byte to represent it.

okayyyy, I think you said something very deep there, and
my mind is still wrapping around it.

But I thought that all Tiles can naturally be treated as SHORTs (2 bytes each), only the level-editor requires that they be read in as INTs.

My impression was that the 3rd and 4th bytes were always set to zero,
and that in the rare event that it was not zero that it caused a MAV.

So, since there are 65535 (numeric) Tiles, there is no need for a
65535-byte lookup table (looking into INT Tiles).

unless there are actually some usable 3-byte or 4-byte Tiles, and I forgot
which ones they were.

__ added:__
If certain 2-byte Tiles are statistically popular enough, then
they can be 'replaced' by 1 byte (by means of a look-up table),

...or maybe even replacing them with a 'nibble'. ?
Now that would be super compression, right there.

The 2nd scenario is to allow ALL 65535 (2-byte) Tiles to be replaced
by 1 byte, by creating 255 lookup tables,
Table 0 would be the list of Top-Contenders, Table 1 would be second in line, and so on.
( Each table would hold a list of 255 2-byte Tiles, -- represented by 1 byte),
... then in coding, we would insert a "switch table" command.

When members of Table 0 occur, then compression occurs, especially
because Table 0 contains the most poular Tiles.
When repeated member of any other Table occur (one after another)
then compression also occurs, because the program remembers what
Table N mode it is in, so the "switch Table" command floats, until it needs to be changed again.

And then, maybe do a 'survey' as to which Tiles are the most popular or numerous;

For example, we can assume that the following get into the Top 255 List:

EMPTY, FLOOR 1-5, WALLS 1-4, WATER, ICE 1-5,
{ALL 16 Conveyors}, { ALL 9 Bridges }, ELECTRO 1-4, { ALL 16 Cannons }
LAVA, { ALL 16 Gates }, { ALL 28 BUTTONS }, { ALL 8 Teleporters }
{ ALL 20 Signs }, { Spikes 1-4}, { Fake Walls 1-3 },
{ All 13 of Box Factory }, 3 Transporters, Trampoline,
{ All 16 PUSH-Cannons }, Sticky CUBE, 4 LINK SPHERES, 10 WARP Gates, 3D ICON, 8 SHADOWStinks, ...

That is a count of 198 so far, leaving us with about 57 empty slots yet to be filled.

The survey can be done by an analyzer program or can be done by a human,
simply choosing "what's what".

The top 255 winners get promoted to 1-byte status,
(stored into a 1-byte look-up table), while the others remain 2-byte.

I attached the old hex codes list that was made around 2008.
I think someone made a tidier version, but I don't know where it is.

Well, I'll be back, will try to get a few things done.

tyteen4a03 · Post by **tyteen4a03** » Fri Aug 24, 2012 1:07 am

Randomizing the bytes will only do more harm than good - you will need to store a table of bytes in correct order, and if it's corrupt, the world's over.

The replace-popular-tiles method is better used for serialization, which is why I am confused about the compress/serialization target.

Also, are we tolerating hex-editing? That's the thing I want to know, the extensive tiles added by hex-editing will definitely have potential of breaking the system.

Muzozavr · Post by **Muzozavr** » Fri Aug 24, 2012 10:57 am

tyteen4a03 wrote:Randomizing the bytes will only do more harm than good - you will need to store a table of bytes in correct order, and if it's corrupt, the world's over.

VirtLands wants to store the "random" bytes in a seed. So that using this seed, the "random number" algorithm would simply generate them again. However, compression works with non-randomness. The more random the data is, the harder it is to compress. So I think it simply won't do much good.

The replace-popular-tiles method is better used for serialization, which is why I am confused about the compress/serialization target

Compression AND serialization.

Also, are we tolerating hex-editing? That's the thing I want to know, the extensive tiles added by hex-editing will definitely have potential of breaking the system.

I would like to tolerate hex-editing, which is exactly why I don't like the idea of replacing the more popular tiles with just 1 byte. You'd then need additional header bytes to make the program know whether the next tile is 1 byte or 2 bytes and that would just make a mess.

Still, we really need to figure out whether usable two-byte objects or three-byte tiles even exist. That's really needed.

MyNameIsKooky · Post by **MyNameIsKooky** » Fri Aug 24, 2012 7:29 pm

Muzozavr wrote:Still, we really need to figure out whether usable two-byte objects or three-byte tiles even exist. That's really needed.

Due to the way the game reads the object/tile data, they cannot exist. The object data is 8-bit ("single bytes" like 9A, FF, etc) and the tile data is 16-bit ("double bytes" like 4E62, 0504, etc). What this means is that the object data gets read 8 bits (1 byte) at a time, and makes an output based on those bits. Likewise, the tile data gets read 16 bits (2 bytes) at a time. Trying to make a 2-byte-long object will just result in two different objects. Likewise, trying to make a 3-byte-long tile will probably end up crashing the game due to setting the entirety of the level data ahead of it 1 byte ahead and thus messing up the game trying to read the data.

Regarding Hex-Edited tiles: Unless you have some idea of a super algorithm thing, you're pretty much going to have to leave the tile data uncompressed.

VirtLands · Post by **VirtLands** » Thu Sep 06, 2012 1:56 am

I finally found some time to get back to this.

MyNameIsKooky wrote:.... What this means is that the object data gets read 8 bits (1 byte) at a time, and makes an output based on those bits. Likewise, the tile data gets read 16 bits (2 bytes) at a time. Trying to make a 2-byte-long object will just result in two different objects. Likewise, trying to make a 3-byte-long tile will probably end up crashing the game due to setting the entirety of the level data ahead of it 1 byte ahead and thus messing up the game trying to read the data.
Regarding Hex-Edited tiles: Unless you have some idea of a super algorithm thing, you're pretty much going to have to leave the tile data uncompressed.

Muzozavr wrote:Still, we really need to figure out whether usable two-byte objects or three-byte tiles even exist. That's really needed.

Interesting theories there.

Well, I took the time to study what happens when extending 2-byte Tiles into 3-byte Tiles:

I made a simple study level, {TEST.LV6}, as follows:

____

The code for water is 012C (hex) = 300 (decimal)
Viewing it in the hex editor, it is Little-Endian: 2C:01:00:00 (=an int)

All "normal" Tiles can be represented by the short "2C:01"

So, I hex-edited that int to see what happens if the higher two "00:00"
are NOT zero anymore:

I replaced 2C:01:00:00 with 2C:01:01:01 , and the results are:

A) It makes the level-editor crash when loading into it. [MAV]

B) It runs just fine in the game, and notice the difference is that
the water Tile is replaced by some invisible Tile that Stinky can walk on
(as displayed in the screenshot above.)

I tried some other random replacement values as well such as
2C:01:AA:BB, 2C:01:55:55, and the results are the same,
which is an invisible walkable Tile, (starting with "2C:01:" )
---------------------------
I have not done an exhaustive search of all 3-byte (or 4-byte) possibilities;
I kind of doubt that they all just turn into invisible walk-ables;
There must be some others that are more interesting than that.
If anyone wants to test out the possibilities you're welcome to start.

I also investigated replacing byte-objects with multi-byte objects,
and this is what happens:

A) The level-editor crashes.
B) The game turns them into zero (nothing), as if nothing's there.

which proves that objects are read as INTs, but only values 0-255 are
accepted by the game.

As an example, I turned a prism [10:00:00:00]
into [10:01:00:00], and this caused it to NOT exist in the game.

It would be great if someone makes a level editor that won't crash on the mutants.

VirtLands · Post by **VirtLands** » Thu Sep 06, 2012 2:41 am

The following is some very strange and technical math, that you're not likely to find anywhere else on the web:

This stuff it is real, there are no errors; I spent hours and hours on it.

I continued to study on the compression options, and
took the entire day to study VARints in detail,
to see if this data format is worthwhile for aiding in compression.

I invented some variations of the VARINT, (wow, that's a mouthful).

using the identical logic that describes Varints,
to also invent VarNibbles, and also VarBits of any bit-length.

So the following is very complex:
For an intro to Varints, read here: --> https://developers.google.com/protocol- ... s/encoding

I created a demonstration program to analyze different variable-length
VarBits, (including the already known VarInt).

I somewhat doubt that VarBits variables are very useful for great compression. But it is something new to learn.

VarInts, (and VarNibbles) store their data by allowing the
first 'bit' to be a continuation bit.
(Each first bit does not add value, they are strictly for continuation only. The highest value that can stored in a VarINT segment = 127.

For other VarBits types, the highest storeable value = 2^bits -1

I'll give some nice examples for those of you that are wondering:

Demonstrates using a VARNibble to store various numbers:

I created exact separation points where the number of nibbles changes;
This helps in studying whether or not VarNibbles are useful.

As an example, you can interpolate that 255 lies between
64 and 511, therefore it requires 3 VarNibble segments to store, = 2.5 bytes.

Base = 8 : 4-bit varBit = The VarNibble

0 = 0000
1 = 0001
2 = 0010
3 = 0011
4 = 0100
5 = 0101
6 = 0110

7 = 0111
8 = 1000 : 0001

63 = 1111 : 0111
64 = 0000 : 1000 : 0001

511 = 1111 : 1111 : 0111
512 = 0000 : 0000 : 1000 : 0001

4095 = 1111 : 1111 : 1111 : 0111
4096 = 0000 : 0000 : 0000 : 1000 : 0001

32767 = 1111 : 1111 : 1111 : 1111 : 0111
32768 = 0000 : 0000 : 0000 : 0000 : 1000 : 0001

262143 = 1111 : 1111 : 1111 : 1111 : 1111 : 0111
262144 = 0000 : 0000 : 0000 : 0000 : 0000 : 1000 : 0001

2097151 = 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 0111
2097152 = 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 1000 : 0001

16777215 = 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 0111
16777216 = 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 1000 : 0001

134217727 = 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 0111
134217728 = 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 1000 : 0001

1073741823 = 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 0111
1073741824 = 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 0000 : 1000 : 0001

-1 = 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 1111 : 0011

________________________________________________________________________________
Base = 128 : 8-bit varBits : = also known as a VARINT.

0 = 00000000
1 = 00000001
2 = 00000010
3 = 00000011
4 = 00000100
5 = 00000101
6 = 00000110

127 = 01111111
128 = 10000000 : 00000001

16383 = 11111111 : 01111111
16384 = 00000000 : 10000000 : 00000001

2097151 = 11111111 : 11111111 : 01111111
2097152 = 00000000 : 00000000 : 10000000 : 00000001

268435455 = 11111111 : 11111111 : 11111111 : 01111111
268435456 = 00000000 : 00000000 : 00000000 : 10000000 : 00000001

-1 = 11111111 : 11111111 : 11111111 : 11111111 : 00001111
________________________________________________________________________________
Base = 64 : 7-bit varBits : (There is no name for this one.)

0 = 0000000
1 = 0000001
2 = 0000010
3 = 0000011
4 = 0000100
5 = 0000101
6 = 0000110

63 = 0111111
64 = 1000000 : 0000001

4095 = 1111111 : 0111111
4096 = 0000000 : 1000000 : 0000001

262143 = 1111111 : 1111111 : 0111111
262144 = 0000000 : 0000000 : 1000000 : 0000001

16777215 = 1111111 : 1111111 : 1111111 : 0111111
16777216 = 0000000 : 0000000 : 0000000 : 1000000 : 0000001

1073741823 = 1111111 : 1111111 : 1111111 : 1111111 : 0111111
1073741824 = 0000000 : 0000000 : 0000000 : 0000000 : 1000000 : 0000001

-1 = 1111111 : 1111111 : 1111111 : 1111111 : 1111111 : 0000011

Muzozavr · Post by **Muzozavr** » Thu Sep 06, 2012 11:22 am

Wow. I kinda liked the idea of taking out the zeroes, but... yeah. I would even hesitate to do this with objects, though it's probably possible to do that.

As for tiles, then we just use byte-pair encoding (after all, 00 00 is still going to be the most common pair -- no one's going to hex everything! Plus common objects, like walls, floors, etc.) since it's going to be mega-useful. Probably we'll have to use the "PackBits" version, but that's OK.

For objects... maybe also byte-pair encoding??? Other than taking out extra zeroes, I *really* can't think of anything else. 00 00 is still going to be the most common byte-pair, after all. Even after the "extra" zeroes are taken out.

I really can't think of anything else right now...

As for VARNibbles I'm really not smart enough to write something proper about them -- I wrote a whole section about them as a reply, but deleted it because I realized that everything I wrote in it was totally stupid.

VirtLands · Post by **VirtLands** » Thu Sep 06, 2012 11:04 pm

Muzozavr wrote:I really can't think of anything else right now...

As for VARNibbles I'm really not smart enough to write something proper about them -- I wrote a whole section about them as a reply, but deleted it because I realized that everything I wrote in it was totally stupid.

That would be interesting to see your opinion on VarBits...
I guess this is starting to be an alien topic,

because it's so strange.

So, I did a performance test on converting various sample files
into streams of VarBits numbers
to see if there is much compression achieved.

And here are the results:

input FileName = C:\wonderlands\RTW\CustomLevels\BillyBob239.LV6
normall file size = 1897 bytes
For 2-bit Vars (base- 2), Varbits member count = 3784 : bits count = 7568 : bytes = 946
For 3-bit Vars (base- 4), Varbits member count = 2733 : bits count = 8199 : bytes = 1025
For 4-bit Vars (base- 8 ), Varbits member count = 2439 : bits count = 9756 : bytes = 1220
____________________________________________________________________________________________
input FileName = C:\wonderlands\RTW\CustomTextures\CloverBox\walltop.jpg
natural FileSize = 2548 bytes
For 8-bit Vars (base- 128), Varbits member count = 3637 : bits count = 29096 : bytes = 3637
For 9-bit Vars (base- 256), Varbits member count = 2548 : bits count = 22932 : bytes = 2867
For 10-bit Vars (base- 512), Varbits member count = 2548 : bits count = 25480 : bytes = 3185
____________________________________________________________________________________________
input FileName = C:\wonderlands\RTW\screenshot.bmp
natural FileSize = 1440054 bytes
For 8-bit Vars (base- 128), Varbits member count = 1650389 : bits count = 13203112 : bytes = 1650389
For 9-bit Vars (base- 256), Varbits member count = 1440054 : bits count = 12960486 : bytes = 1620061
____________________________________________________________________________________________
input FileName = C:\wonderlands\RTW\CustomModels\aliens\model1.3DS
natural FileSize = 327929 bytes
For 2-bit Vars (base- 2), Varbits member count = 1418140 : bits count = 2836280 : bytes = 354535
For 3-bit Vars (base- 4), Varbits member count = 805485 : bits count = 2416455 : bytes = 302057
For 4-bit Vars (base- 8 ), Varbits member count = 616292 : bits count = 2465168 : bytes = 308146
____________________________________________________________________________________________

In summary, the only file that showed an immense amount of
compression was the LV6 file { BillyBob239.LV6 } because it has a lot of zeroes in it.
Using 2-bit VarBits it got squeezed down from 1897 bytes to 946 bytes, (compressed to about 50% of original size).

The BMP and the JPG actually increased slightly in size using Varbits, (no compression achieved.)

The .3DS file shrunk a little, using 3-bit varBits (to 92% of original).

I don't think that I'll ever use any of this VarInts or VarBits stuff;
It was just something to learn.

If anyone wants to see the VarInT source code files, I've attached it here.

My next post will be on HUFFMAN STUFF only; I won't mention VarBits stuff anymore.

Muzozavr · Post by **Muzozavr** » Thu Sep 06, 2012 11:17 pm

My idea that I think I can clearly describe is that your varnibbles (and varbits in general) are one half of Huffman coding.

They're a lot like the prefix codes in Huffman, minus the whole "probability" element, which makes Huffman that much more efficient.

So that's the reason why I don't think this is as effective as it seems. We are still applying Huffman near the end, after all.

VirtLands · Post by **VirtLands** » Fri Sep 07, 2012 2:07 am

Okay, now we're talking about the HUFFMAN CODE.

There is some sample programming code on the HUFFMAN on this link:
http://rosettacode.org/wiki/Huffman_coding

Wikipedia definition:
http://en.wikipedia.org/wiki/Huffman_co ... man_coding

Don't worry about the coding though, I shall program it all in Blitz3D which is better.

The resultant compressed (HUFFMANed) bits can be preceded by the Symbol & Weight Table.
It is best to include the Symbol & Weight Table along
with the compressed bits because this allows for the recreation of the exact Huffman Tree.

You need the S&W table to decode the bits.

(Storing the Huffman Tree instead of the S&W Table is a clumsy idea
and would probably take up even more space than the S&W table.)

The S&W table also allows for additional compression by allowing
for the option of using Adaptive Huffman Coding, which means that
the Huffman Tree can rearrange itself, and change according to
changing weights or frequencies.

A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates.

Here's an idea of mine:
The symbols must be stored in the S&W table in the same order
as they appear in the source file, (not storing repeats) allowing for
the usage of a "Grab New Symbol" function. I'll explain in a moment.

For simplicity, pretend that the source is simply the following:

"ABCDABXYEEXACAAA"

then the Symbol & Weight Table will look like this:

s:w, s:w, s:w, ... ( symbols followed by weights )

A:6, B:2, C:2, D:1, X:2, Y:1, E:2

( The symbols & weights can perhaps be shortened with VarBits, or it
can be predetermined ahead of time that they are bytes or ints. )

In the special cases of Stinky, Loof, Peegue, ... where we know that the weight=1,
(because there is only one Stinky in the game),
then we can do a special case and skip the weight, because we already know that it is =1.

Example scenario: zero-object:4, Stinky, coin:5, zero-object:6, Loof, box:4, torch:1, zero-object:4, .....

So, the software will know to NOT read a weight after reading "Stinky";
That is an example of optimization and data compression.
("zero-object" simply means no object.)

The following quote basically explains how to gradually create the Huffman Tree.

The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n leaf nodes and n-1 internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

Type Tnode
Field P.tnode ; parent node
Field L.tnode ; left node
Field R.tnode ; right node
Field t ; = Tnode type, that is, 0=function, 1=numeric type
Field v ; symbol value
Field w ; weight (=number of times that symbol occurs)

Field pre.tnode, nex.Tnode ; to be used for sorting purposes only,.. 'previous' , 'next'

Field bit$ ; this is the Huffman bit code, described in a binary fashion such as "001100"...
End Type

A "node" type will be a structure that has connections to both
a left-node and a right-node, (=child nodes).

I'll try to explain the process of HUFFMAN Tree construction:

To begin with, Each s:w member (Symbol:Weight)
is stored in its own NODE memory. IT has a summary weight equal
to W, and it at first has no children.

.Start
Choose from the list the two lowest weights, and these shall be joined
together as children of a brand-new node, and this brand-new node
shall have an equivalent weight equal to the sum of the children's weights.

Go to .Start, and repeat the process again until
1 NODE becomes the parent of all nodes (= the Head NODE )

Note: Nodes cannot be joined together that have already become child-nodes of
a parent node;
These become excluded from the "seek the two lowest" list.

That was exhausting to explain; Hope that makes sense.

Something I invented:
For further HUFFMAN optimization we can include
some functions (in addition to regular numeric symbols).

..and just like numbers, the functions have their weights too.

{ f1:w, f2:w, f3:w, f4:w,.... }

The functions can be added to the HUFFMAN Tree, by replacing
each function with a 'symbol' and therefore each function
shall also have a "prefix-code".
It will be very complicated, especially if all 256 symbols get used up.

The functions can be as follows:

A) (=) is a duplicate of previous number
B) (n) Grab a new symbol from the S&W table
C) (+) is one more than the previous
D) (-) is one less than the previous
E) (f) is a follower of the previous

{ The functions and their weights shall be stored in an additional S&W table to be appended to the regular (numeric) S&W table.
Place order, instead of symbols can represent 'functions' since they are abstract. }

In the case of (A), we shall calculate the number of duplicate situations,
such as when a number = the previous.

In the case of (B), it's weight shall be identical to the number of unique symbols.

(C) and (D) are optional and can be included if there are significant
series such as the following:
3,4,5,6,7,8,20,25,1,2, 12,11,10,9,8,7,6... which gets converted to 3+++++,20,25,1+,12------

( The "+"s and "-"s become repetitions, and allow for compression.)

Optionally, if there is only one X, Y, Z in the source such as

aaaaaaXaaaaaaYaaaaaabbbbbZaaaa then this can
be converted to:

a=====na=====na=====b====na===

The 'n' is repeated about 3 times (or more), because it replaced X,Y,Z...!
The reason for this is that it retrieves a new symbol from the
S&W table which has data arranged in the same order as each symbol's
appearance in the source.

If a second 'X' were to appear later on, then we can not use the 'n' function to represent it;
We can only use the 'n' function when a brand new symbol is about to appear.

Another example: If there are only one each of A,B,C,D,E,F,
(or they make premiere appearances)
then "ABCboxDEF" gets converted to "nnnnBOXnnnn"

because the S&W table has "ABCDEFGHIJ..."

The "=" function is obviously very useful because ALL repetitions can be
replaced by it, even if they are not the SAME repetitions.
A repetition is a repetition even if the starting symbol is new.

If you have ideas on likely "symbol functions" that I left out, then let me know.

As each symbol's weight count diminishes (because it is used up), occasionally a symbol's weight becomes zero, and
therefore it disappears from the HUFFMAN tree, allowing the HUFFMAN tree
to re-organize itself and all remaining "prefix-codes" become shorter.

Example: Source = "aaaaaaaaaaaBBBBBBBBcccDDDEEDD"
Because there are so many "a"s and "B"s, these will mostly likely be
assigned the shortest prefix codes (such as "0", and "10" ),
then the "D"s and "E"s will have longer prefix codes.
As the "a"s and "B"s get used up, and there are eventually none left,
then the HUFFMAN Tree reorganizes itself, and ultimately the
"D"s and "E"s shall have the shortest codes.

Wouldn't it be funny if the S&W table (which is required for deciphering)
is also HUFFMANed into the stream of compressed bits, meaning
that you need the decoding table (the S&W) to decode it,
but the decoding table is also compressed and stored "inside".

That would be some awesome cryptogram, where you'd guess
at S&W entries, ... ; I don't know if that's doable.

It's kind of like storing the passcode to an encrypted Zip file
inside that Zip file.

Muzozavr · Post by **Muzozavr** » Fri Sep 07, 2012 9:17 am

Wouldn't it be funny if the S&W table (which is required for deciphering)
is also HUFFMANed into the stream of compressed bits, meaning
that you need the decoding table (the S&W) to decode it,
but the decoding table is also compressed and stored "inside".

First reaction upon waking up in the morning: "that makes sense, but still, what in the heck".

I still can't say I'm actually understanding ALL of the "tree construction" process, but your explanations make a byte more sense than the ones on Wikipedia.

VirtLands · Post by **VirtLands** » Fri Sep 21, 2012 10:46 pm

Hi again,

As a side-topic, I decided to test all permutations of 2-Byte Tile CODES in levels to see what they reveal.

For 2-byte tiles, there are 65536 Tile possibilities, (0 to 0xFFFF ).

I created a program to automatically generate these level (LV6) files,
All permutations are represented by spanning them across several levels.

It took 328 (LV6) levels at 200 (=20x10) unique blocks each to represent all 63336 Tile possibilities ( in 2-bytes per tile ).
It took 2 days of coding to invent and get the generating program to run just right.

My suspicions confirmed that the majority of the Tiles turned out to be "invisible walkables", (invisible tiles that Stinky can walk on).

The rest of them were "valuable game Tiles", with some of them being bad tiles that cause "MAVs"
(MAV=memory access violation!)

I attached the level files here (all ~328 of them), so that anyone can browse the possibilities.
For convenience, I divided it into two groups, that is 'Important Tiles' and the boring 'invisible walkables'.

( I'm only 90% certain that the 'invisible walkables' zip has no interesting stuff in it, as it is exhausting checking all 328 levels,
so, I may have missed something.

)

I also tested (a large subset of) 3-BYTE codes and 4-BYTE codes.

And the conclusion is that assigning non-zero values to 3rd or 4th byte doesn't do anything special except turn everything into
an "invisible walkable" tile, and that's not much fun.

But, potentially, this can be used to store extraneous data of some sort, since "invisible walkables" are not good for much else.

[some crazy stuff here:

]
Also, just to point it out, I think you can also store an entire level within a SIGN (of another level), by storing that level in SIGN string data.
The SIGN would be unusually large for a SIGN, but logically it should not have any adverse effect on the 'holder' level, unless you try to
display that SIGN, in which case I'm positive it would result in a MAV.

You can also store a level within the 'zeroes' of another level, by perhaps carefully choosing which 'zeroes' to replace, the downside
is potentially converting a regular level into an "invisible walkable"..

Then you can share your one level and tell your firends, "hey it's really 10 levels, it's just that I crammed the other 9 levels inside the SIGNS."
You'll need a special data extractor for that one.

---

Each level in the zip uses the same template, which is a 20x10 block of tiles surround by a normal perimeter of FLOOR 1-type, to make it easy
for Stinky to move around.
There are 10 SIGNS (on the left side) that give info about the range of values of those 20 Tiles following, (values are in hex and decimal).
You should basically ignore the 11th Sign; I didn't plan that part very well.

There are objects placed in each level to study interactions with tiles.
The name of the level.LV6 itself indicates the range of values within the
20x10 block, that's elementary.

The filenames that start with "MAV_####-####.LV6" are obviously the ones that I discovered cause MAVs, so please don't load those into game.
Later on when I have more time, I will figure out exactly which tiles cause the MAVs, because I know that it can't be all 200 tiles that do it.

Also, I might redo this project and create smaller tile groups of about 10
instead of huge blocks of 200. The reason is that in a group of 200, when one tile causes a MAV, then the whole level is useless.

Enjoy the data.

Muzozavr · Post by **Muzozavr** » Sun Sep 23, 2012 11:05 am

Since three-byte or four-byte objects are going to be incredibly rarely used, I'm still hoping for some clever way to account not only for the possibility itself (so that such levels can be compressed properly) but also for the rarity of said possibility (so that we can still save space on those zeroes)

Took a look at the "interesting" zip. Some mutations I know are either not there, or are in the MAV levels -- the "instant death" tile, the A7:03 glitch transporter, the other glitch transporter that leaves a row of invisible trampolines behind, etc...

The first "MAV" level (the one with all the bridges) actually loaded for me... for some reason. The rest didn't, though.

The gate/button level is by far the most mind-blowing, allowing for potentially incredible unfair levels -- walkable gates that become unwalkable once certain buttons are pressed? Square and round buttons that act like timers? White gates that don't open with any of the white buttons? White gates that open with an INVISIBLE button? White gates that open with a "special" COLORED button, like yellow??? That's pure awesome.

VirtLands · Post by **VirtLands** » Sun Sep 23, 2012 11:31 pm

Muzozavr wrote:Since three-byte or four-byte objects are going to be incredibly rarely used,
I'm still hoping for some clever way to account not only for the possibility itself (so that such levels can be compressed properly)
but also for the rarity of said possibility (so that we can still save space on those zeroes)

For sure, I believe that there can be several modes of complexity
built into a compressor program.
A simple scan test can determine if a level is A or B :

A: level is categorized as "NORMAL, 2-byte Tiled"
or
B: level is categorized as "UNUSUAL, using 3 or 4-byte Tiles"

In the case of B, then perhaps a sacrifice shall be made in compression efficiency by letting all those bytes become HUFFMANed.
( A large sequence of zeroes will still shrink down a lot.)

I just took a closer look at that "gate\button" level (03E8_04AF(gates & buttons).LV6) and was
delighted by the abundance of weird buttons and gates.
Including that one with the yellow buttons.

I shall get back to it tomorrow, and will also check out that A7:03 you mentioned.

By the way, the 03E8_04AF(gates & buttons).LV6
causes a MAV for me on exiting the level, but not on starting the level.

I'm starting to think that I should have researched fewer tiles per level.
Should have done about 20 per level instead of 200 per level.
Oh well, that's a project for another day.

Thanks Muzozavr for your ideas.

LV6 compression and Codes

LV6 compression and Codes

compression continues...

LV6 Reader Code

Re: LV6 Reader Code

LV6 compression continuued.

LV6 compression continuued.

quotable HUFFs

VarBits

Varbits

Files to VarBits

HUFFMAN

2-Byte Tile Codes

exotic buttons and gates