Update 2 (newest)
Here’s the situation:
A foreign application is storing zlib deflated (compressed) data in this format:
78 9C BC (...data...) 00 00 FF FF
– let’s call it DATA1
If I take original XML file and deflate it in Java or Tcl, I get:
78 9C BD (...data...) D8 9F 29 BB
– let’s call it DATA2
- Definitely the last 4 bytes in DATA2 is the Adler-32 checksum, which in DATA1 is replaced with the zlib FULL-SYNC marker (why? I have no idea).
- 3rd byte is different by value of 1.
- The
(...data...)
is equal between DATA1 and DATA2. - Now the most interesting part: if I update the DATA1 changing the 3rd byte from
BC
toBD
, leave last 8 bytes untouched (so0000FFFF
) and inflating this data withnew Inflater(true)
(https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/zip/Inflater.html#%3Cinit%3E(boolean)), I am able to decode it correctly! (because the Inflater in this mode does not require zlib Adler-32 checksum and zlib header)
Questions:
- Why does changing
BC
toBD
work? Is it safe to do in all cases? I checked with few cases and worked each time. - Why would any application output an incorrect (?) deflate value of
BC
at all? - Why would the application start with a zlib header (
78 9C
), but not produce compliant zlib structure (FLUSH-SYNC instead of Adler-32)? It’s not a small hobby application, but a widely used business app (I would say dozens of thousands of business users).
After further analysis it seems that I have a zlib-compressed byte array that misses the final checksum (adler32).
According to RFC 1950, the correct zlib format must end with the adler32 checksum, but for some reason a dataset that I work with has zlib bytes, that miss that checksum. It always ends with 00 00 FF FF
, which in zlib format is a marker of SYNC FLUSH. For a complete zlib object, there should be adler32 afterwards, but there is none.
Still it should be possible to inflate such data, right?
As mentioned earlier (in original question below), I’ve tried to pass this byte array to Java inflater (I’ve tried with one from Tcl too), with no luck. Somehow the application that produces these bytes is able to read it correctly (as also mentioned below).
How can I decompress it?
Original question, before update:
Context
There is an application (closed source code), that connects to MS SQL Server and stores compressed XML document there in a column of image
type. This application – when requested – can export the document into a regular XML file on the local disk, so I have access to both plain text XML data, as well as the compressed one, directly in the database.
The problem
I’d like to be able to decompress any value from this column using my own code connecting to the SQL Server.
The problem is that it is some kind of weird zlib format. It does start with typical for zlib header bytes (78
9C
), but I’m unable to decompress it (I used method described at Java Decompress a string compressed with zlib deflate).
The whole data looks like 789CBC58DB72E238...7E526E7EFEA5E3D5FF0CFE030000FFFF
(of course dots mean more bytes inside – total of 1195).
What I’ve tried already
What caught my attention was the ending 0000FFFF
, but even if I truncate it, the decompression still fails. I actually tried to decompress it truncating all amounts of bytes from the end (in the loop, chopping last byte per iteration) – none of iterations worked either.
I also compressed the original XML file into zlib bytes to see how it looks like then and apart from the 2 zlib header bytes and then maybe 5-6 more bytes afterwards, the rest of data was different. Number of output bytes was also different (smaller), but not much (it was like ~1180 vs 1195 bytes).
Advertisement
Answer
The difference on the deflate side is that the foreign application is using Z_SYNC_FLUSH
or Z_FULL_FLUSH
to flush the provided data so far to the compressed stream. You are (correctly) using Z_FINISH
to end the stream. In the first case you end up with a partial deflate stream that is not terminated and has no check value. Instead it just ends with an empty stored block, which results in the 00 00 ff ff
bytes at the end. In the second case you end up with a complete deflate stream and a zlib trailer with the check value. In that case, there happens to be a single deflate block (the data must be relatively small), so the first block is the last block, and is marked as such with a 1
as the low bit of the first byte.
What you are doing is setting the last block bit on the first block. This will in general not always work, since the stream may have more than one block. In that case, some other bit in the middle of the stream would need to be set.
I’m guessing that what you are getting is part, but not all of the compressed data. There is a flush to permit transmission of the data so far, but that would normally be followed by continued compression and more such flushed packets.
(Same question as #2, with the same answer.)