Kitz Forum

Chat => Tech Chat => Topic started by: Weaver on November 09, 2018, 05:04:50 AM

Title: Compressors - versioning
Post by: Weaver on November 09, 2018, 05:04:50 AM
Let’s say that you create a large archive file that contains several versions of some source code. Each of the versions is nearly identical, just a few bytes have changed between the versions. The archive just contains the while lot, concatenated together and then compressed a lot. Now a version control systems that I once used would difference the source files and just store reverse deltas so the latest version would be stored in full, for very quick retrieval and instead of storing older versions the system would merely store reverse deltas, generated by computing reverse edit operations which need to be applied to the latest version to get back from each later version to the one before it, so the time to retrieve an old version increases linearly, costing  k * ( v_latest - v_requested ) + k0.

I wondered though, is there a good compressor around that is not ‘vcs-award’ but just accidentally happens to do a very good job if large parts of the archive are identical to other parts? So it just happens to give huge compression in the case where you are storing multiple versions.
Title: Re: Compressors - versioning
Post by: adrianw on November 10, 2018, 01:55:10 PM
xz with a LOT of memory may do the job for you.
Title: Re: Compressors - versioning
Post by: CarlT on November 10, 2018, 03:58:10 PM
Any that are any good as that's basically how they work?

They find as large a block of unique data as they can find and where it's duplicated represent the duplicated block with a pointer to the original.

They aren't as good as dedicated deduplication solutions but should do a fair job of reducing redundancy. LZ etc work in this way.
Title: Re: Compressors - versioning
Post by: Weaver on November 10, 2018, 07:56:52 PM
‘As that’s basically how they work’ - agreed, but my caution was due to a concern with a possible limitation if the size of the range over which a compressor looks, how far back it can go. Very large buffers are needed, which might rule out some. I expected some to be very effective, others less so because this is not the kind of situation they expect to have to handle. And certainly not all compressors work as L2, like back-pointer compression. Huffman compressors for example.

[Moderated edited, typo correction . . . s/nit/not/]
Title: Re: Compressors - versioning
Post by: CarlT on November 10, 2018, 11:13:47 PM
That's the same limitation with everything. Sooner or later you run out of resources to remember what you've already seen and store changes only.

Bit of an apples and oranges thing here. Compression is intended for a single pass or with a limited buffer to transmute compression of a stream. What you're talking about is a completely different thing, storing a baseline data set and then writing changes only.

Can get some mileage from doing both: running a deduplication algorithm then running legacy compression on the data that can't be deduplicated and the pointers.

What you're talking about, though, is basically what LZ etc already kinda do. No single run compression can write deltas for obvious reasons. Deduplication is the best they can do and, combined with LZ or similar, it produces about the best results.

This is the kind of format used in various storage arrays. Storing the deltas then, when transferring data to a DR array, possibly running further compression on the streams. WAN optimisation kit also does similar, deduplicating data streams in real time, replacing their store of data to deduplicate against in LRU fashion, and using stream compression to reduce data that cannot be deduplicated as it's first pass and remote side needs it, alongside LZ on the output after deduplication.
Title: Re: Compressors - versioning
Post by: Weaver on November 10, 2018, 11:58:56 PM
Combining two method makes a lot of sense - small scale and larger scale.