Abraham Lempel and Jacob Ziv are considered to be pioneers in the field of data compression techniques, this is due to two academic papers they jointly published in the late 1970s that outlined a technique to efficiently compress data without any loss of information. For this article I’ll be looking at the algorithm outlined in the first paper, which is commonly known known as LZ1.

LZ1 looks for repeated sequences of data, and records references to these duplicated chunks by storing pointers alongside the original information. These pointers are then used to rebuild the data back to its original structure. It works as follows: bytes are read in a sequence from first to last, the location of the byte currently being inspected is known as the ‘coding point’. Each byte at the coding point is compared to information preceding it, in a buffer known as a ‘window’, this window would not be expected to match the whole size of the data, but it may extend to more than a few thousand bytes. If one, or many bytes within this window match the sequence currently located at the coding point, then a pointer is saved to the compressed data. This pointer consists of an offset from the current coding point to the repeated data, and how many bytes have been matched. This value is subsequently described as the pointer’s ‘length’. When the pointer is recorded, the byte immediately after the matched sequence at the coding point is also stored. If no match was found within the window, which would always happen when the first element is read from a file, then a null pointer is saved along with the byte at the coding point is saved with it. So whether we find a match in our window or not, we always save a pointer alongside a single byte as we work our way through the data.

So take the following example, a string of ‘A’s, followed by a ‘B’, and finally a ‘C’:

AAAAAAAABC

The first ‘A’ is inspected, and the window contains no data at this point, so a null pointer is recorded alongside the byte at the coding point, i.e. the letter A. The coding point then moves to the next byte, the second ‘A’. The algorithm has seen this before, namely in the previous byte, then it looks at how many bytes that follows also match if we were to advance the coding point whilst still referring back to the offset, which at the moment has the value of one. This is where the algorithm can be quite clever and match beyond the current coding point; currently the pointer is located at the first byte, and the coding point at the second, both are incremented whilst both the bytes located at each match. In this case, the total amount of bytes that are equivalent is seven, this will be until the ‘B’ is reached. If at this point we had already read more data, and therefore the window would be larger, then we would perform the same matching operation further down the data for the entire size of the window. Since this is our only, and therefore longest match, a pointer with the offset of one and a length of seven is written to the compressed data, as is next byte at the coding point beyond the pointer length, the letter B. And now the coding point moves past the ‘B’, and is inspected the final byte. It hasn’t seen a ‘C’ before in the window, so a null-pointer is saved along with the letter C. It’s important to avoid pushing the coding point past the length of the data when storing the final pointer and byte. To illustrate this scenario, consider this example: if our piece of data was composed only of eight ‘A’s, without the ‘B’, or ‘C’, then the initial ‘A’ would be recorded with a null-pointer, as previously noted, then a pointer to the first ‘A’ with a pointer length of six, with the final ‘A’ alongside it. The reason we would give the pointer a length of six and not seven is because we would still need to record the final byte along with the pointer, and matching seven characters in our pointer would force the coding point to go beyond the length of the data because an additional byte is expected with the final pointer.

Decompressing is an extremely simple task, the coding point and window are still held in memory whilst the data is being decoded. When a pointer offset is encountered, the data at the pointer offset is copied to the current coding point for however many times have been recorded by the pointer length, after this, the byte held with the pointer is then inserted at the coding point. So for our first example, the null-pointer is found and ignored, and then the first byte, the ‘A’, is read and inserted at the coding point. The second pointer is read, and we can see that the offset refers back to the previous character, so we copy everything we’ve already inserted at our previous coding point for the next seven iterations. This will give us seven ‘A’s. Alongside the pointer, the next character is given, ‘B’, so we add that at our coding point, and advance the coding point again. Finally, a null-pointer is discovered along with the final byte, a ‘C’, so we ignore the pointer and add the byte at the coding point. After this, our original data structure has successfully been rebuilt.

Adding pointers to the compressed data incurs a cost as well as a benefit. If the pointer is too large when stored in the compressed data, then it would become less useful. For this reason the size of the window and the length of bytes are also limited. For the code examples that follow, I’ve chosen to store the pointers a 16-bit byte, twelve bits specify a pointer offset, and four for the length. This results in a window of 4096 bytes, and a maximum pointer length of fifteen bytes. After experimenting with various combinations of size and length, a 16-bit pointer with these dimensions appears to be a beneficial trade-off. For most text files it’s unlikely that a sequence longer than fifteen bytes would be repeated - although this would be more suitable for machine readable files with long portions of repeated data. A window of 4096 bytes also gives a good chance to find repeated data, anything smaller than this appears to negatively impact the size of the compressed data. Having a window that is excessively large also increases the time it takes to compress the data, as ultimately more data is compared. There are also memory constraints to consider too if the compression and decompression routines only store the window when performing its task.

In the code below, the compression routine accepts the following arguments: a pointer to a sequence of bytes for compression, the size of the uncompressed data, and lastly, a pointer to the location in memory to store the output. For decompressing, only a pointer to the compressed data, and a pointer to where the uncompressed data will be held are needed. The reason why the uncompressed data size is not required in the latter routine is because when the compressed data is initially written, the final uncompressed length is recorded at the start as a 32-bit unsigned integer, so the decompression routine knows when to stop decoding data by comparing the final size of the data to the current value of the coding point.

#include <stdint.h>

uint32_t lz77_compress (uint8_t *uncompressed_text, uint32_t uncompressed_size, uint8_t *compressed_text)
{
    uint8_t pointer_length, temp_pointer_length;
    uint16_t pointer_pos, temp_pointer_pos, output_pointer;
    uint32_t compressed_pointer, output_size, coding_pos, output_lookahead_ref, look_behind, look_ahead;
    
    *((uint32_t *) compressed_text) = uncompressed_size;
    compressed_pointer = output_size = 4;
    
    for(coding_pos = 0; coding_pos < uncompressed_size; ++coding_pos)
    {
        pointer_pos = 0;
        pointer_length = 0;
        for(temp_pointer_pos = 1; (temp_pointer_pos < 4096) && (temp_pointer_pos <= coding_pos); ++temp_pointer_pos)
        {
            look_behind = coding_pos - temp_pointer_pos;
            look_ahead = coding_pos;
            for(temp_pointer_length = 0; uncompressed_text[look_ahead++] == uncompressed_text[look_behind++]; ++temp_pointer_length)
                if(temp_pointer_length == 15)
                    break;
            if(temp_pointer_length > pointer_length)
            {
                pointer_pos = temp_pointer_pos;
                pointer_length = temp_pointer_length;
                if(pointer_length == 15)
                    break;
            }
        }
        coding_pos += pointer_length;
        if(pointer_length && (coding_pos == uncompressed_size))
        {
            output_pointer = (pointer_pos << 4) | (pointer_length - 1);
            output_lookahead_ref = coding_pos - 1;
        }
        else
        {
            output_pointer = (pointer_pos << 4) | pointer_length;
            output_lookahead_ref = coding_pos;
        }
        *((uint32_t *) (compressed_text + compressed_pointer)) = output_pointer;
        compressed_pointer += 2;
        *(compressed_text + compressed_pointer++) = *(uncompressed_text + output_lookahead_ref);
        output_size += 3;
    }
    
    return output_size;
}

uint32_t lz77_decompress (uint8_t *compressed_text, uint8_t *uncompressed_text)
{
    uint8_t pointer_length;
    uint16_t input_pointer, pointer_pos;
    uint32_t compressed_pointer, coding_pos, pointer_offset, uncompressed_size;
    
    uncompressed_size = *((uint32_t *) compressed_text);
    compressed_pointer = 4;
    
    for(coding_pos = 0; coding_pos < uncompressed_size; ++coding_pos)
    {
        input_pointer = *((uint32_t *) (compressed_text + compressed_pointer));
        compressed_pointer += 2;
        pointer_pos = input_pointer >> 4;
        pointer_length = input_pointer & 15;
        if(pointer_pos)
            for(pointer_offset = coding_pos - pointer_pos; pointer_length > 0; --pointer_length)
                uncompressed_text[coding_pos++] = uncompressed_text[pointer_offset++];
        *(uncompressed_text + coding_pos) = *(compressed_text + compressed_pointer++);
    }
    
    return coding_pos;
}

I have also included the following routines to load and save files locally, as well as an enttry-point function to demonstrate the program compressing its own source file, then output it to a file, and then finally decompress it as a copy of the original. For this to work correctly all the quoted code must be held in a file named ‘lz.c’, although hopefully this will be clear from reading the code. When passing memory to the compression routine, this code only allocates 640KB, which should be enough for anyone. Obviously this should be increased when dealing with larger files.

For reference, the original size of the uncompressed code is 5111 bytes, compressed this becomes 1834 bytes - a total saving of 3277, or 64% of the original.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

long fsize (FILE *in)
{
    long pos, length;
    pos = ftell(in);
    fseek(in, 0L, SEEK_END);
    length = ftell(in);
    fseek(in, pos, SEEK_SET);
    return length;
}

uint32_t file_lz77_compress (char *filename_in, char *filename_out)
{
    FILE *in, *out;
    uint8_t *uncompressed_text, *compressed_text;
    uint32_t uncompressed_size, compressed_size;
    
    in = fopen(filename_in, "r");
    if(in == NULL)
        return 0;
    uncompressed_size = fsize(in);
    uncompressed_text = malloc(uncompressed_size);
    if((uncompressed_size != fread(uncompressed_text, 1, uncompressed_size, in)))
        return 0;
    fclose(in);
    
    compressed_text = malloc(655360);
    
    compressed_size = lz77_compress(uncompressed_text, uncompressed_size, compressed_text);
    
    out = fopen(filename_out, "w");
    if(out == NULL)
        return 0;
    if((compressed_size != fwrite(compressed_text, 1, compressed_size, out)))
        return 0;
    fclose(out);
    
    return compressed_size;
}

uint32_t file_lz77_decompress (char *filename_in, char *filename_out)
{
    FILE *in, *out;
    uint32_t compressed_size, uncompressed_size;
    uint8_t *compressed_text, *uncompressed_text;
    
    in = fopen(filename_in, "r");
    if(in == NULL)
        return 0;
    compressed_size = fsize(in);
    compressed_text = malloc(compressed_size);
    if(fread(compressed_text, 1, compressed_size, in) != compressed_size)
        return 0;
    fclose(in);
    
    uncompressed_size = *((uint32_t *) compressed_text);
    uncompressed_text = malloc(uncompressed_size);
    
    if(lz77_decompress(compressed_text, uncompressed_text) != uncompressed_size)
        return 0;
        
    out = fopen(filename_out, "w");
    if(out == NULL)
        return 0;
    if(fwrite(uncompressed_text, 1, uncompressed_size, out) != uncompressed_size)
        return 0;
    fclose(out);
    
    return uncompressed_size;
}
    
int main (int argc, char const *argv[])
{
    FILE *in;
    in = fopen("./lz.c", "r");
    if(in == NULL)
        return 0;
    printf("Original size: %ld\n", fsize(in));
    fclose(in);
    printf("Compressed: %u\n", file_lz77_compress("./lz.c", "lz.c.z77"));
    printf("Decompressed: %u", file_lz77_decompress("./lz.c.z77", "lz-2.c"));
    return 0;
}