I'm totally not a programmer, so I can't implement any of the stuff that I'm going to say here. I don't even know how hard it is to implement anything below.
The "always there" parts of LV6 could be taken away from the file and put into the program itself, but I'm assuming that (other than the header, maybe) we cut as little as possible on this phase 1 of the plan.
Also byte-pair encoding is usually done in multiple passes, while many other algorithms are single-pass.
Comments in
BOLD.
VirtLands wrote:*snip*
C8 00 00 00 which is the code for
WALL 1, and so on.
So since you always gave us hex mutations in the form of byte pairs, I take it that the third and fourth bytes are always zero? What happens if they aren't zero?
Same for objects, actually. You gave us hex codes as single bytes. So for objects, the last three bytes are all zero? What happens if they aren't zero? (I'll return to this with the keyphrase "assuming zeroes")
*snip*
In Summary, the data in an LV6 is stored in this order:
__________________________________________________
ID STRING: "Stinky & Loof Level File v6", (or v4, or v5)
Since it's identical for all levels (except for the version number) I suggest we remove it from the file completely and replace it with some short-short header (like "KZX") so that the program knows where to put the header back.
FileName STRING: "FRANKIE" (without the .LV6)
Keep as-is. We can't make any assumptions and the string is too short for general algorithms.
INT: (mysterious 4 bytes that I have never deciphered,
it's probably a randomly generated serial # used to make the file "unique".)
Keep as-is. They are probably needed for score-keeping. We don't want any problems.
Title STRING: "Visiting Frankie"
Most likely, keep as-is. No assumptions to be made and probably too short for general-purpose algorithms. If the title is long, maybe something can work, but I doubt it.
INT: always set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomHouse
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomModel
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomTexture
INT: set to either 1 or 0 (I'm not really sure what this is for.)
STRING: CustomBackground
What happens if it's set to something else? If nothing (other than a MAV), then they are essentially four bits. So we can put all four INTs in a single BYTE. The names of custom houses/models/textures/backgrounds are to be kept as-is.
INT: read in Timer seconds (24000 seconds = 10 hours

)
Keep as-is. Due to hex-editing, we can't assume it's convertible to a byte value.
INT: read in Style, (00 - 09) = { Cave,Sand,Wood,Purple,Castle,Jade,Spooky,Garden,Aztec,
Custom }
What happens if you put anything higher than 09? If nothing, then make it BYTE.
INT: read in Background, (00 - 0A) ={ Sky,Forest,Walls,Stars,Flat,Water,Lava,Warp,City,Rainbow,
Custom }
For some reason, 0B is also stars. 0C and anything higher is pure black. Make it BYTE.
INT: level width
INT: level height
I know, hex-editing, "we can't assume" and all... but do those two *really* need to be INT? Can anyone ever need anything above 255x255?
...followed by a 2D (
INT) array of Tiles
Assuming zeroes, one could halve the size instantly by removing these extra zeroes. Then the INTs get converted to byte pairs. Then we can do a modified RLE that works with a byte pair instead of a byte. Or the array can be subjected to byte-pair encoding. I don't know what would be more efficient. We could do both, but then I don't know what order is best.
...followed by a 2D (
INT) array of Objects
Assuming zeroes, one could make this 1/4 of the original size simply by removing these extra zeroes. There's still going to be a lot of zeroes left to signify a "null" object. Since object rows are also common, a standard RLE would probably be a good idea. Afterwards, though, I'm at a loss for anything sneaky. Maybe it will be time for the huffman then?
Followed by an array of 20 SIGN strings:
SIGN$[1], SIGN$[2], SIGN$[3], SIGN$[4]....
Remember that when a SIGN does not exist it is simply stored as [
00 00 00 00]
So it's a bunch of zeroes then? Wow, that's really stupid. Easiest solution is to RLE all the zeroes. Maybe I can think of something more efficient later.
The position of the data relates to the SIGN number, so
sign 1 will be stored first, followed by sign 2, followed by sign 3...
If one can figure out a compact way to write which sign is which, you could group all existing signs together even if it's something like 3, 4, 7, 9 and 19. Then all the zeroes are together and RLE becomes AMAZING for this portion.
..and next is the MUSIC INT, which is the
last data in the file.
INT: Music
Value "0" does not always default to track 6. If you play such a level before playing any others, it will be silent. Used by Mark in "The Song of Silence". So the two values have to be distinguished. Again, needs to be made BYTE.
(Also, since there are only seven distinct values, one could put this in the same byte as the four INTs for custom houses/models/textures/backgrounds. Then we'll need 7 bits out of 8 in that byte.)
*snip*
And afterwards, we're huffmaning the whole thing and then doing some conversion to an ASCII string. Right?
That's my quick rundown, anyway... I hope that someone else joins the discussion and improves this.
Additions:
1) Since our compression format is going to be quite screwy, we need some sort of a separator symbol (or string) that separates different parts of this structure. That separator has to be something that cannot appear anywhere else in the level. I propose an INT separator that looks like this:
[any-byte-that-corresponds-to-a-control-character][FF][AA][DD]
Control characters cannot occur in an ASCII strings, so we can be safe from the separator appearing in the level title or somewhere else like that. As for the rest of the format, as far as I understand, FFAADD is a byte triplet that can't appear anywhere in a working (non-MAVing) level. And it's easy to remember.

(And since we're doing the ASCII conversion on a later stage, the control character won't be a problem)
2) One could make an optional flag for lossy compression with "Lose decorative object variations". Then all walls/mushrooms/lampposts/etc would become WALL 1 and all floors would become FLOOR 1. (Exception: the invisible "wall X" is often used as part of custom models, so it stays as-is. "Deep wall" stays as-is, too, for obvious reasons) This would make byte-pair encoding much more efficient for those parts.
3) Another option is to trim away text from signs that aren't actually in the level. If you put a sign on a level, enter some text in it and then remove the sign, the text stays in the file, which, IMO, is kinda weird. Some people might want to embed information in this way, though. So this needs to be optional.
4) In "The Reader Is Warned" Mark has edited the structure of one of his signs to make the game crash as soon as the player tries to read it. We have to keep this possibility in mind. Actually, in general, we should try to explore what happens if certain format specifications are violated with hex-editing. If the result loads at all, we have to keep it in mind.
5) For RLE, I just found out about its PackBits version, which might be better for some of our situations:
http://en.wikipedia.org/wiki/PackBits