1 what is the essence of the information compression process. Data compression concept

ARCHIVATORS

Compression of information Is the process of transforming the information stored in a file by reducing data redundancy. The purpose of this process is to reduce the amount of data occupied.

Archive file Is a specially created file containing one or more compressed files.

Compression ratio: K c = V c / V o * 100%

K c- compression ratio, V c- volume compressed file, V o- the original size of the file.

The compression ratio depends on:

1) the program used - the archiver,

2) compression method,

3) the type of the source file: text, graphic, video, sound, etc.

Programs that pack and unpack files are called archivers. The most common are: ARJ, ZIP, RAR. The extension of the archive files coincides with the name of the archiver used to create them.

Archivers allow you to create self-extracting archive files, i.e. to unpack them, you do not need to run the archiving program, since they themselves contain the unpacking program. These archives are called SFX archives
(SelF-eXtracting). The extension of such files is * .EXE.


Information compression principles

There are repeating characters in any text. It is possible to specify one character and the number of repetitions. The efficiency of this algorithm is even higher when applied to graphic files. If you look at the monitor, you can see a lot of repeating dots of the same color. The format is based on this principle of information compression. graphic files PCX. Modern archivers distinguish not only repeating symbols, but also chains of symbols, individual words.

If the text does not use all the characters of the PC alphabet, then to encode them, you can use in place of one byte, 8 bits, less number. This principle is used in the telegraph apparatus, where only Russians are used. capital letters, to represent them, 5 bits are enough, which allows you to write three characters in two bytes.

3. The following principle uses the regularity that letters in the text occur with different frequencies. For example, in this text the space is the most common character, the symbols "a", "and" are very often found. These common symbols can be represented by a short combination of bits, the rest of the symbols can be encoded with a longer sequence. For instance:

4. Physically, the PC allocates space for placing files on the disk in clusters - in 4 KB blocks. It is impossible to highlight less. For example, if the file is 8193 bytes (8 KB and 1 byte), physically it will be 16 KB or 16384 bytes. Combining a group of files into one allows you to save on these leftovers. When packing small files, this is a big savings.

In total, when the files are placed separately, 6 KB are not used, which is 100% of the content of the files. In the second case, 2 KB, 33%, remains unused.


Zip archiver

Packing pkzip files [keys]<имя архива>[file paths]

Keys: -rp archiving with subdirectories keeping the structure

S PWD archive password protection (PWD)

A add files to archive

M move files to archive

V view the contents of the archive

If all files in a directory are being archived, be sure to specify the mask *.*

Unpacking pkunzip files [keys]<имя архива>[filenames]

Switches: -d unpacking with subdirectories keeping the structure

SPWD archive password (PWD)


Arj archiver

arj<команда>[keys]<имя архива>[filenames]

For the arj archiver, one file performs both unpacking and packing operations.

Commands: a archiving

e unpacking without preserving directory structure

x unpacking with preservation of structure

l view archive contents

m move files to archive

d delete files from archive

Switches: -r packing with subdirectories keeping the structure

V splitting the archive into volumes with volume vol (if specified)

the size for standard floppies (360, 720, 1200, 1440) is indicated in kilobytes, the size of non-standard floppies is indicated in bytes

V is indicated when unpacking a multivolume archive

G PWD archive password ( PWD)

Zip files

Unpacking files

© 2015-2019 site
All rights belong to their authors. This site does not claim authorship, but provides free use.
Date the page was created: 2016-08-08

Nowadays, many users are thinking about how the process of compressing information is carried out in order to save free space on the hard drive, because this is one of the most effective means of using the usable space in any drive. Quite often, modern users who are faced with a lack of free space on a drive have to delete any data in order to free up the necessary space, while more advanced users most often use data compression in order to reduce its volume.

However, many do not even know what the process of compressing information is called, not to mention which algorithms are used and what the application of each of them gives.

Should you compress data?

Data compression is quite important today and is necessary for any user. Of course, in our time, almost everyone can purchase advanced data storage devices that provide the ability to use a sufficiently large amount of free space, as well as equipped with high-speed information broadcast channels.

However, in this case, you need to correctly understand that over time, the amount of data that needs to be transferred also increases. And if literally ten years ago, the volume of 700 MB was considered standard for an ordinary film, today films made in HD quality can have volumes equal to several tens of gigabytes, not to mention how much free space is occupied by high-quality films. in Blu-ray format.

When is data compression necessary?

Of course, you should not expect that the process of compressing information will bring you much benefit, but there are a number of situations in which some methods of compressing information are extremely useful and even necessary:

  • Transfer of certain documents via email. This is especially true for those situations when you need to transfer information in large volumes using various mobile devices.
  • Often the process of compressing information in order to reduce the space it occupies is used when publishing certain data on various sites when it is necessary to save traffic;
  • Save free space on your hard drive when there is no way to replace or add new storage media. In particular, the most common situation is when there are certain restrictions on the available budget, but there is not enough free disk space.

Of course, in addition to the above, there are still a huge number of different situations in which the process of compressing information may be required in order to reduce its volume, but these are by far the most common.

How can data be compressed?

Today, there are a wide variety of methods for compressing information, but they are all divided into two main groups - this is compression with certain losses, as well as compression without loss.

The use of the last group of methods is relevant when data must be recovered with extremely high accuracy, down to one bit. This approach is the only relevant in the event that a certain text document is compressed.

At the same time, it is worth noting the fact that in some situations there is no need for the most accurate recovery of compressed data, therefore, it is possible to use such algorithms in which the information on the disk is compressed with certain losses. The advantage of lossy compression is that this technology is much easier to implement and also provides the highest possible degree of archiving.

Lossy compression

Lossy information provides an order of magnitude better compression, while maintaining sufficient information quality. In most cases, the use of such algorithms is carried out to compress analog data, such as all kinds of images or sounds. In such situations, the unpacked files can be quite different from the original information, but they are practically indistinguishable from the human eye or ear.

Lossless compression

Lossless compression algorithms provide the most accurate data recovery, eliminating any loss of compressed files. However, in this case, it is necessary to correctly understand the fact that in in this case file compression is not so efficient.

Generic methods

Among other things, there is a certain number of universal methods by which an effective process of compressing information is carried out in order to reduce the space it occupies. In general, there are only three main technologies:

  • Stream transformation. In this case, the description of the new incoming uncompressed information is carried out through the already processed files, while no probabilities are calculated, but the characters are encoded based solely on those files that have already undergone a certain processing.
  • Statistical compression. This process compression of information in order to reduce the space it occupies on the disk is divided into two subcategories - adaptive and block methods. The adaptive version provides for calculating the probabilities for new files based on information that has already been processed during the encoding process. In particular, these methods should also include various adaptive variants of the Shannon-Fano and Huffman algorithms. The block algorithm provides for a separate calculation of each block of information with subsequent addition to the most compressed block.
  • Block transformation. The incoming information is divided into several blocks, and then a holistic transformation takes place. It should be said that certain methods, especially those based on permutation of several blocks, can ultimately lead to a significant reduction in the amount of compressible information. However, it must be correctly understood that after such processing, a significant improvement ultimately occurs, as a result of which the subsequent compression through other algorithms is much easier and faster.

Copy compression

One of the most important components Reserve copy is the device to be moved to needed by the user information. The more data you move, the more voluminous device you will need to use. However, if you are going to carry out the process of compressing data, then in this case the problem of lack of free space is unlikely to remain relevant to you.

Why is this needed?

The ability to compress information while significantly reducing the time that will be needed for copying required files and at the same time achieve effective savings in free space on the drive. In other words, when using compression, information will be copied much more compactly and quickly, and you can save your money and finances, which were necessary to buy a larger drive. Among other things, by compressing information, you also reduce the time that will be required when transporting all data to the server or copying it over the network.

Compression of data for backup can be carried out into one or several files - in this case, everything will depend on which program you use and what information you compress.

When choosing a utility, be sure to look at how much your chosen program can compress data. It depends on the type of information, as a result of which the compression efficiency text documents can be more than 90%, while it will be effective no more than 5%.

Good day.
Today I want to touch on the topic of lossless data compression. Despite the fact that there were already articles on Habré devoted to some algorithms, I wanted to talk about this in a little more detail.
I will try to give both a mathematical description and a description in the usual form, so that everyone can find something interesting for themselves.

In this article, I will touch on the fundamentals of compression and the main types of algorithms.

Compression. Is it necessary in our time?

Of course, yes. Of course, we all understand that now we have access to both high-volume storage media and high-speed data transmission channels. However, at the same time, the volumes of transmitted information are growing. If a few years ago we watched 700-megabyte films that fit on one disc, today films in HD quality can take tens of gigabytes.
Of course, there is not much benefit to compressing everything and everyone. But there are still situations in which compression is extremely useful, if not necessary.

  • Sending documents by e-mail(especially large volumes of documents using mobile devices)
  • When publishing documents on websites, the need to save traffic
  • Save disk space when replacing or adding storage is difficult. For example, this happens in cases where it is not easy to knock out a budget for capital expenditures, and there is not enough disk space.

Of course, you can think of many more situations in which compression will be useful, but these few examples are enough for us.

All compression methods can be divided into two broad groups: lossy compression and lossless compression. Lossless compression is used when information needs to be restored with bit precision. This approach is the only one possible when compressing, for example, text data.
In some cases, however, accurate information recovery is not required and algorithms that implement lossy compression are allowed, which, in contrast to lossless compression, is usually easier to implement and provides a higher degree of archiving.

So, let's move on to considering lossless compression algorithms.

Versatile lossless compression methods

In general, there are three basic options on which compression algorithms are built.
First group methods - stream transformation. This assumes the description of new incoming uncompressed data through the already processed. In this case, no probabilities are calculated, character encoding is carried out only on the basis of the data that has already been processed, as, for example, in LZ - methods (named after Abraham Lempel and Jacob Ziv). In this case, the second and further occurrences of a certain substring already known to the encoder are replaced with references to its first occurrence.

Second group Methods are statistical compression methods. In turn, these methods are divided into adaptive (or streamlined), and block.
In the first (adaptive) version, the probabilities for new data are calculated using data that has already been processed during encoding. These methods include adaptive variants of the Huffman and Shannon-Fano algorithms.
In the second (block) case, the statistics of each data block are calculated separately and added to the most compressed block. These include static versions of the Huffman, Shannon-Fano, and arithmetic coding methods.

Third group Methods are so-called block conversion methods. The incoming data is split into blocks, which are then transformed in their entirety. At the same time, some methods, especially those based on block permutation, may not lead to a significant (or even any) reduction in the amount of data. However, after such processing, the data structure is significantly improved, and the subsequent compression by other algorithms is more successful and fast.

General principles on which data compression is based

All data compression methods are based on a simple logical principle. If we imagine that the most common elements are encoded with shorter codes, and less common ones with longer codes, then less space will be required to store all data than if all elements were represented by codes of the same length.
The exact relationship between element frequencies and optimal code lengths is described in the so-called Shannon's source coding theorem, which defines the maximum lossless compression limit and Shannon's entropy.

A bit of math
If the probability of occurrence of an element s i is equal to p (s i), then it will be most advantageous to represent this element - log 2 p (s i) in bits. If during encoding it is possible to achieve that the length of all elements will be reduced to log 2 p (s i) bits, then the length of the entire encoded sequence will be minimal for all possible encoding methods. Moreover, if the probability distribution of all elements F = (p (s i)) is unchanged, and the probabilities of the elements are mutually independent, then the average length of the codes can be calculated as

This value is called the entropy of the probability distribution F, or the entropy of the source in time this moment time.
However, usually the probability of the appearance of an element cannot be independent, on the contrary, it depends on some factors. In this case, for each new encoded element s i, the probability distribution F will take some value F k, that is, for each element F = F k and H = H k.

In other words, we can say that the source is in state k, which corresponds to a certain set of probabilities p k (s i) for all elements s i.

Therefore, taking into account this amendment, we can express the average length of the codes as

Where P k is the probability of finding a source in state k.

So, at this stage, we know that compression is based on replacing frequently occurring elements with short codes, and vice versa, and we also know how to determine the average length of codes. But what is code, coding, and how does it happen?

Memoryless encoding

Memoryless codes are the simplest codes on the basis of which data compression can be performed. In out-of-memory code, each character in the encoded data vector is replaced with code word from a prefix set of binary sequences or words.
In my opinion, this is not the most understandable definition. Let's consider this topic in a little more detail.

Let some alphabet be given , consisting of a certain (finite) number of letters. Let's call each finite sequence of characters from this alphabet (A = a 1, a 2, ..., a n) word, and the number n is the length of this word.

Let another alphabet also be given ... Similarly, let us denote a word in this alphabet as B.

Let us introduce two more notation for the set of all non-empty words in the alphabet. Let be the number of non-empty words in the first alphabet, and - in the second.

Let also a mapping F be given, which assigns to each word A from the first alphabet some word B = F (A) from the second. Then the word B will be called code word A, and the transition from the original word to its code will be called coding.

Since a word can also consist of one letter, we can identify the correspondence between the letters of the first alphabet and the corresponding words from the second:
a 1<->B 1
a 2<->B 2

a n<->B n

This correspondence is called scheme, and are denoted by ∑.
In this case, the words B 1, B 2, ..., B n call elementary codes, and the type of encoding with their help is alphabetical coding... Of course, most of us have come across this kind of coding, even without knowing everything that I described above.

So, we have defined the concepts alphabet, word, code, and coding... Now we introduce the concept prefix.

Let the word B have the form B = B "B" ". Then B" is called the beginning, or prefix word B, and B "" is its end. This is a fairly simple definition, but it should be noted that for any word B, both a certain empty word ʌ ("space"), and the word B itself, can be considered both beginnings and ends.

So, we come close to understanding the definition of codes without memory. The last definition that remains for us to understand is the prefix set. The scheme ∑ has the prefix property if for any 1≤i, j≤r, i ≠ j, the word B i is not a prefix of the word B j.
Simply put, a prefix set is a finite set in which no element is a prefix (or start) of any other element. A simple example such a set is, for example, an ordinary alphabet.

So, we figured out the basic definitions. So how does the actual encoding without memory happen?
It takes place in three stages.

  1. An alphabet of Ψ symbols of the original message is compiled, and the alphabet symbols are sorted in descending order of their probability of occurrence in the message.
  2. Each symbol a i from the alphabet Ψ is associated with a certain word B i from the prefix set Ω.
  3. Each character is encoded, followed by combining the codes into one data stream, which will be the results of compression.

One of the canonical algorithms that illustrates this method is the Huffman algorithm.

Huffman algorithm

The Huffman algorithm uses the frequency of occurrence of the same bytes in the input data block, and assigns frequently occurring blocks to shorter bitstrings, and vice versa. This code is minimally redundant. Let us consider the case when, regardless of the input stream, the alphabet of the output stream consists of only 2 characters - zero and one.

First of all, when coding with the Huffman algorithm, we need to build a circuit ∑. This is done as follows:

  1. All letters of the input alphabet are ordered in descending order of probabilities. All words from the alphabet of the output stream (that is, what we will encode with) are initially considered empty (recall that the alphabet of the output stream consists only of characters (0,1)).
  2. Two symbols a j-1 and a j of the input stream, which have the lowest probabilities of occurrence, are combined into one "pseudo-symbol" with the probability p equal to the sum of the probabilities of the symbols included in it. Then we add 0 to the beginning of the word B j-1, and 1 to the beginning of the word B j, which will subsequently be the codes of the characters a j-1 and a j, respectively.
  3. We remove these symbols from the alphabet of the original message, but add the generated pseudo-symbol to this alphabet (of course, it must be inserted into the alphabet in the right place, taking into account its probability).
Steps 2 and 3 are repeated until only 1 pseudo-character remains in the alphabet, containing all the original characters of the alphabet. In this case, since at each step and for each character, the corresponding word B i changes (by adding one or zero), then after the completion of this procedure, a certain code B i will correspond to each initial symbol of the alphabet a i.

For a better illustration, consider a small example.
Suppose we have an alphabet consisting of only four characters - (a 1, a 2, a 3, a 4). Let us also assume that the probabilities of these symbols occurrence are, respectively, p 1 = 0.5; p 2 = 0.24; p 3 = 0.15; p 4 = 0.11 (the sum of all probabilities is obviously equal to one).

So, let's build a scheme for a given alphabet.

  1. We combine the two characters with the lowest probabilities (0.11 and 0.15) into a pseudo-character p ".
  2. Combine the two characters with the lowest probability (0.24 and 0.26) into the pseudo-character p "".
  3. Remove the concatenated characters, and insert the resulting pseudo-character into the alphabet.
  4. Finally, we combine the remaining two symbols to get the top of the tree.

If you make an illustration of this process, you get something like this:


As you can see, each time we merge, we assign the codes 0 and 1 to the merged characters.
This way, when the tree is built, we can easily get the code for each character. In our case, the codes will look like this:

A 1 = 0
a 2 = 11
a 3 = 100
a 4 = 101

Since none of these codes is a prefix of any other (that is, we got the notorious prefix set), we can uniquely identify each code in the output stream.
So, we have achieved that the most frequent character is encoded by the shortest code, and vice versa.
If we assume that initially one byte was used to store each character, then we can calculate how much we managed to reduce the data.

Let us assume that at the input we had a string of 1000 characters, in which the character a 1 occurred 500 times, a 2 - 240, a 3 - 150, and a 4 - 110 times.

Initially given string occupied 8000 bits. After encoding, we get a string of length ∑p i l i = 500 * 1 + 240 * 2 + 150 * 3 + 110 * 3 = 1760 bits. So, we managed to compress the data by a factor of 4.54, spending an average of 1.76 bits to encode each character of the stream.

Let me remind you that according to Shannon, the average length of the codes is. Substituting our probabilities into this equation, we get the average code length equal to 1.75496602732291, which is very, very close to our result.
However, it should be borne in mind that in addition to the data itself, we need to store the encoding table, which will slightly increase the final size of the encoded data. Obviously, in different cases, different variations of the algorithm can be used - for example, sometimes it is more efficient to use a predetermined table of probabilities, and sometimes it is necessary to compile it dynamically, by going through the compressed data.

Conclusion

So, in this article I tried to talk about general principles, by which lossless compression occurs, and also considered one of the canonical algorithms - Huffman coding.
If the article is to the liking of the habro community, then I will gladly write a sequel, since there are still many interesting things related to lossless compression; these are both classical algorithms and preliminary data transformations (for example, the Burrows-Wheeler transformation), and, of course, specific algorithms for compressing audio, video and images (the most interesting topic, in my opinion).

Literature

  • Vatolin D., Ratushnyak A., Smirnov M., Yukin V. Methods of data compression. The device of archivers, compression of images and videos; ISBN 5-86404-170-X; 2003 r.
  • D. Salomon. Compression of data, images and sound; ISBN 5-94836-027-X; 2004

GORKOFF February 24, 2015 at 11:41 AM

Data compression methods

  • Algorithms

My scientific advisor and I are preparing a small monograph on image processing. I decided to submit a chapter on image compression algorithms to the judgment of the HabraSociety. Since it's hard to fit a whole chapter within the framework of one post, I decided to split it into three posts:
1. Methods of data compression;
2. Lossless compression of images;
3. Compression of images with loss.
Below you can read the first post in the series.

At the moment, there are a large number of lossless compression algorithms, which can be conditionally divided into two large groups:
1. Stream and dictionary algorithms. This group includes algorithms of the RLE (run-length encoding), LZ *, etc. families. A feature of all algorithms of this group is that the encoding uses not information about the frequencies of characters in the message, but information about the sequences that were encountered earlier.
2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information using the unevenness of the frequencies with which different characters occur in the message. Algorithms in this group include arithmetic and prefix coding algorithms (using Shannon-Fanno, Huffman, secant trees).
Information transformation algorithms can be distinguished into a separate group. The algorithms of this group do not directly compress information, but their application greatly simplifies further compression using stream, dictionary, and entropy algorithms.

Streaming and dictionary algorithms

Run length coding

Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeated characters is replaced by a character and the number of times it is repeated.
For example, the string "AAAAA", which requires 5 bytes to store (assuming that a byte is allocated for storing one character), can be replaced with "5A", consisting of two bytes. Obviously, the longer the repetition series, the more efficient this algorithm is.

The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if we consider the sequence "ABABAB" (6 bytes), then after applying the RLE algorithm it will turn into "1A1B1A1B1A1B" (12 bytes). There are various methods to deal with the problem of non-repeating characters.

The simplest method is the following modification: the byte encoding the number of repeats must store information not only about the number of repeats, but also about their presence. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters that must be taken without repetition. If you encode "ABABAB" using this modification, we get "-6ABABAB" (7 bytes). Obviously, the proposed technique can significantly improve the efficiency of the RLE algorithm on non-repeating sequences of characters. The implementation of the proposed approach is shown in Listing 1:

  1. type
  2. function RLEEncode (InMsg: ShortString): TRLEEncodedString;
  3. MatchFl: boolean;
  4. MatchCount: shortint;
  5. EncodedString: TRLEEncodedString;
  6. N, i: byte;
  7. begin
  8. N: = 0;
  9. SetLength (EncodedString, 2 * length (InMsg));
  10. while length (InMsg)> = 1 do
  11. begin
  12. MatchFl: = (length (InMsg)> 1) and (InMsg [1] = InMsg [2]);
  13. MatchCount: = 1;
  14. while (MatchCount<= 126 ) and (MatchCount < length(InMsg) ) and ((InMsg[ MatchCount] = InMsg[ MatchCount + 1 ] ) = MatchFl) do
  15. MatchCount: = MatchCount + 1;
  16. if MatchFl then
  17. begin
  18. N: = N + 2;
  19. EncodedString [N - 2]: = MatchCount + 128;
  20. EncodedString [N - 1]: = ord (InMsg [1]);
  21. else
  22. begin
  23. if MatchCount<>length (InMsg) then
  24. MatchCount: = MatchCount - 1;
  25. N: = N + 1 + MatchCount;
  26. EncodedString [N - 1 - MatchCount]: = - MatchCount + 128;
  27. for i: = 1 to MatchCount do
  28. EncodedString [N - 1 - MatchCount + i]: = ord (InMsg [i]);
  29. end;
  30. delete (InMsg, 1, MatchCount);
  31. end;
  32. SetLength (EncodedString, N);
  33. RLEEncode: = EncodedString;
  34. end;

Decoding a compressed message is very simple and boils down to a single pass over the compressed message, see Listing 2:
  1. type
  2. TRLEEncodedString = array of byte;
  3. function RLEDecode (InMsg: TRLEEncodedString): ShortString;
  4. RepeatCount: shortint;
  5. i, j: word;
  6. OutMsg: ShortString;
  7. begin
  8. OutMsg: = "";
  9. i: = 0;
  10. while i< length(InMsg) do
  11. begin
  12. RepeatCount: = InMsg [i] - 128;
  13. i: = i + 1;
  14. if RepeatCount< 0 then
  15. begin
  16. RepeatCount: = abs (RepeatCount);
  17. for j: = i to i + RepeatCount - 1 do
  18. OutMsg: = OutMsg + chr (InMsg [j]);
  19. i: = i + RepeatCount;
  20. else
  21. begin
  22. for j: = 1 to RepeatCount do
  23. OutMsg: = OutMsg + chr (InMsg [i]);
  24. i: = i + 1;
  25. end;
  26. end;
  27. RLEDecode: = OutMsg;
  28. end;

The second method for increasing the efficiency of the RLE algorithm is the use of information transformation algorithms that do not directly compress the data, but bring it to a form that is more convenient for compression. As an example of such an algorithm, we will consider a BWT permutation named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while the repeated substrings after applying the permutation are collected into dense groups, which are much better compressed using the RLE algorithm. Direct BWT conversion is reduced to the sequence of the following steps:
1. Adding a special end-of-line character to the original string, which is not found anywhere else;
2. Getting all cyclic permutations of the original string;
3. Sorting the received lines in lexicographic order;
4. Returning the last column of the resulting matrix.
The implementation of this algorithm is shown in Listing 3.
  1. const
  2. EOMsg = "|" ;
  3. function BWTEncode (InMsg: ShortString): ShortString;
  4. OutMsg: ShortString;
  5. LastChar: ANSIChar;
  6. N, i: word;
  7. begin
  8. InMsg: = InMsg + EOMsg;
  9. N: = length (InMsg);
  10. ShiftTable [1]: = InMsg;
  11. for i: = 2 to N do
  12. begin
  13. LastChar: = InMsg [N];
  14. InMsg: = LastChar + copy (InMsg, 1, N - 1);
  15. ShiftTable [i]: = InMsg;
  16. end;
  17. Sort (ShiftTable);
  18. OutMsg: = "";
  19. for i: = 1 to N do
  20. OutMsg: = OutMsg + ShiftTable [i] [N];
  21. BWTEncode: = OutMsg;
  22. end;

The easiest way to explain this transformation is with a specific example. Let's take the string "ANANAS" and agree that the end-of-line character will be the symbol "|". All cyclic permutations of this line and the result of their lexicographic sorting are shown in Table. one.

Those. the result of the direct conversion will be the string "| ННАААС". It is easy to see that this string is much better than the original one; it is compressed by the RLE algorithm, since it contains long subsequences of repeated letters.
A similar effect can be achieved with the help of other transformations, but the advantage of the BWT transform is that it is reversible, however, the inverse transform is more difficult than the direct one. In order to restore the original string, you need to do the following:
Create an empty n * n matrix, where n is the number of characters in the encoded message;
Fill the rightmost empty column with the encoded message;
Sort table rows in lexicographic order;
Repeat steps 2-3 until there are empty columns;
Return the line that ends with the end-of-line character.

At first glance, the reverse transformation is straightforward, and one of the options is shown in Listing 4.

  1. const
  2. EOMsg = "|" ;
  3. function BWTDecode (InMsg: ShortString): ShortString;
  4. OutMsg: ShortString;
  5. ShiftTable: array of ShortString;
  6. N, i, j: word;
  7. begin
  8. OutMsg: = "";
  9. N: = length (InMsg);
  10. SetLength (ShiftTable, N + 1);
  11. for i: = 0 to N do
  12. ShiftTable [i]: = "";
  13. for i: = 1 to N do
  14. begin
  15. for j: = 1 to N do
  16. ShiftTable [j]: = InMsg [j] + ShiftTable [j];
  17. Sort (ShiftTable);
  18. end;
  19. for i: = 1 to N do
  20. if ShiftTable [i] [N] = EOMsg then
  21. OutMsg: = ShiftTable [i];
  22. delete (OutMsg, N, 1);
  23. BWTDecode: = OutMsg;
  24. end;

But in practice, the efficiency depends on the selected sorting algorithm. Trivial algorithms with quadratic complexity will obviously have an extremely negative impact on performance, so it is recommended to use efficient algorithms.

After sorting the table obtained in the seventh step, it is necessary to select from the table a row ending with the "|" character. It is easy to see that this is the only line. That. we examined the BWT transformation using a specific example.

Summing up, we can say that the main advantage of the RLE group of algorithms is the simplicity and speed of operation (including the speed of decoding), and the main disadvantage is inefficiency on non-repeating character sets. The use of special permutations increases the efficiency of the algorithm, but also greatly increases the running time (especially decoding).

Dictionary compression (LZ algorithms)

The group of dictionary algorithms, unlike the algorithms of the RLE group, encodes not the number of repetitions of characters, but the previously encountered sequences of characters. During the operation of the algorithms under consideration, a table is dynamically created with a list of previously encountered sequences and their corresponding codes. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

The simplest version of the dictionary algorithm is described below:
Initialize the dictionary with all characters that appear in the input string;
Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
Issue the code of the found sequence and remove it from the beginning of the encoded message;
If the end of the message is not reached, read the next character and add Sc to the dictionary, go to step 2. Otherwise, exit.

For example, the newly initialized vocabulary for the phrase "CUCKOOKUCHONKUPILACAPUCHON" is shown in Table. 3:

During the compression process, the dictionary will be supplemented with the sequences found in the message. The process of replenishing the dictionary is shown in Table. 4.

When describing the algorithm, the description of the situation when the dictionary is filled completely was deliberately omitted. Depending on the variant of the algorithm, different behavior is possible: full or partial emptying of the dictionary, stopping the filling of the dictionary, or expanding the dictionary with a corresponding increase in the code width. Each of these approaches has certain disadvantages. For example, stopping the completion of the dictionary can lead to a situation when the dictionary stores sequences that occur at the beginning of the compressed string, but do not occur later. At the same time, clearing the dictionary can remove frequent sequences. Most of the used implementations, when filling out the dictionary, begin to track the compression ratio, and when it drops below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation that stops replenishing the dictionary when it is full.

First, let's define a dictionary as a record that stores not only the found substrings, but also the number of substrings stored in the dictionary:

Previously encountered subsequences are stored in the Words array, and their codes are the numbers of subsequences in this array.
We also define the functions for searching in the dictionary and adding to the dictionary:

  1. const
  2. MAX_DICT_LENGTH = 256;
  3. function FindInDict (D: TDictionary; str: ShortString): integer;
  4. r: integer;
  5. i: integer;
  6. fl: boolean;
  7. begin
  8. r: = - 1;
  9. if D. WordCount> 0 then
  10. begin
  11. i: = D. WordCount;
  12. fl: = false;
  13. while (not fl) and (i> = 0) do
  14. begin
  15. i: = i - 1;
  16. fl: = D. Words [i] = str;
  17. end;
  18. end;
  19. if fl then
  20. r: = i;
  21. FindInDict: = r;
  22. end;
  23. procedure AddToDict (var D: TDictionary; str: ShortString);
  24. begin
  25. if D. WordCount< MAX_DICT_LENGTH then
  26. begin
  27. D. WordCount: = D. WordCount + 1;
  28. SetLength (D. Words, D. WordCount);
  29. D. Words [D. WordCount - 1]: = str;
  30. end;
  31. end;

Using these functions, the coding process according to the described algorithm can be implemented as follows:
  1. function LZWEncode (InMsg: ShortString): TEncodedString;
  2. OutMsg: TEncodedString;
  3. tmpstr: ShortString;
  4. D: TDictionary;
  5. i, N: byte;
  6. begin
  7. SetLength (OutMsg, length (InMsg));
  8. N: = 0;
  9. InitDict (D);
  10. while length (InMsg)> 0 do
  11. begin
  12. tmpstr: = InMsg [1];
  13. while (FindInDict (D, tmpstr)> = 0) and (length (InMsg)> length (tmpstr)) do
  14. tmpstr: = tmpstr + InMsg [length (tmpstr) + 1];
  15. if FindInDict (D, tmpstr)< 0 then
  16. delete (tmpstr, length (tmpstr), 1);
  17. OutMsg [N]: = FindInDict (D, tmpstr);
  18. N: = N + 1;
  19. delete (InMsg, 1, length (tmpstr));
  20. if length (InMsg)> 0 then
  21. AddToDict (D, tmpstr + InMsg [1]);
  22. end;
  23. SetLength (OutMsg, N);
  24. LZWEncode: = OutMsg;
  25. end;

The result of the encoding will be the word numbers in the dictionary.
The decoding process is reduced to direct decryption of codes, while there is no need to transfer the created dictionary, it is enough that during decoding the dictionary is initialized in the same way as during encoding. Then the dictionary will be completely restored directly during the decoding process by concatenating the previous subsequence and the current symbol.

The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to see that this is only possible when it is necessary to extract a substring to be added at the current step. This means that the substring matches the cSc pattern, i.e. starts and ends with the same character. Moreover, cS is the substring added in the previous step. The considered situation is the only one when it is necessary to decode a line that has not yet been added. Considering the above, we can suggest the following option for decoding a compressed string:

  1. function LZWDecode (InMsg: TEncodedString): ShortString;
  2. D: TDictionary;
  3. OutMsg, tmpstr: ShortString;
  4. i: byte;
  5. begin
  6. OutMsg: = "";
  7. tmpstr: = "";
  8. InitDict (D);
  9. for i: = 0 to length (InMsg) - 1 do
  10. begin
  11. if InMsg [i]> = D. WordCount then
  12. tmpstr: = D. Words [InMsg [i - 1]] + D. Words [InMsg [i - 1]] [1]
  13. else
  14. tmpstr: = D. Words [InMsg [i]];
  15. OutMsg: = OutMsg + tmpstr;
  16. if i> 0 then
  17. AddToDict (D, D. Words [InMsg [i - 1]] + tmpstr [1]);
  18. end;
  19. LZWDecode: = OutMsg;
  20. end;

The advantages of dictionary algorithms include their higher compression efficiency compared to RLE. Nevertheless, it should be understood that the real use of these algorithms is associated with some implementation difficulties.

Entropy coding

Encoding with Shannon-Fano trees

The Shannon-Fano algorithm is one of the first compression algorithms developed. The algorithm is based on the idea of ​​representing more frequent characters using shorter codes. Moreover, the codes obtained using the Shannon-Fano algorithm have the prefix property: i.e. no code is the beginning of any other code. The prefix property ensures that the encoding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
1. Divide the alphabet into two parts, the total probabilities of the symbols in which are as close to each other as possible.
2. Add 0 to the prefix code of the first part of characters, add 1 to the prefix code of the second part of characters.
3. For each part (in which there are at least two characters), recursively perform steps 1-3.
Despite its comparative simplicity, the Shannon-Fano algorithm is not without drawbacks, the most significant of which is the non-optimal encoding. Although the partitioning at each step is optimal, the algorithm does not guarantee an optimal result as a whole. Consider, for example, the following line: "AAAABVGDEZH". The corresponding Shannon-Fano tree and the codes derived from it are shown in Fig. one:

Without encoding, the message will occupy 40 bits (assuming that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4 * 2 + 2 + 4 + 4 + 3 + 3 + 3 = 27 bits. The message volume has decreased by 32.5%, but it will be shown below that this result can be significantly improved.

Encoding with Huffman Trees

The Huffman coding algorithm, developed a few years after the Shannon-Fano algorithm, also possesses the prefix property, and, in addition, the proven minimal redundancy, this is precisely why it is extremely widespread. To obtain Huffman codes, the following algorithm is used:
1. All symbols of the alphabet are represented as free nodes, while the weight of the node is proportional to the frequency of the symbol in the message;
2. From the set of free nodes, two nodes with the minimum weight are selected and a new (parent) node with a weight equal to the sum of the weights of the selected nodes is created;
3. The selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
4. Steps 2-3 are repeated until there is more than one node in the free list;
5. Based on the constructed tree, a prefix code is assigned to each character of the alphabet;
6. The message is encoded with the received codes.

Consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and the codes obtained for the "AAAABVGDEZH" message are shown in Fig. 2:

It is easy to calculate that the size of the encoded message will be 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it should be noted that due to the popularity of the Huffman algorithm, at the moment there are many options for Huffman coding, including adaptive coding, which does not require the transmission of symbol frequencies.
Among the disadvantages of the Huffman algorithm, a significant part are problems associated with the complexity of the implementation. The use of real variables for storing the symbol frequencies is associated with a loss of precision; therefore, in practice, integer variables are often used, but since the weight of parent nodes is constantly growing, sooner or later an overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

Encoding with secant function trees

Encoding using secant functions is an algorithm developed by the authors that allows obtaining prefix codes. The algorithm is based on the idea of ​​building a tree, each node of which contains a secant function. To describe the algorithm in more detail, it is necessary to introduce several definitions.
A word is an ordered sequence of m bits (the number m is called the word length).
Secant literal is a pair of the type of digit-value of the digit. For example, literal (4,1) means that the 4 bits of the word must be 1. If the literal condition is met, then the literal is considered true, otherwise it is false.
A k-bit secant is a set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

The tree is built so that each node divides the alphabet into the closest possible parts. In Fig. 3 shows an example of a secant tree:

The tree of secant functions in the general case does not guarantee optimal coding, but it provides an extremely high speed of work due to the simplicity of the operation in the nodes.

Arithmetic coding

Arithmetic coding is one of the most effective ways compression of information. Unlike the Huffman algorithm, arithmetic coding allows you to encode messages with an entropy of less than 1 bit per symbol. Because most of the arithmetic coding algorithms are protected by patents, only the basic ideas will be described below.
Suppose that in the alphabet used there are N symbols a_1, ..., a_N, with frequencies p_1, ..., p_N, respectively. Then the arithmetic coding algorithm will look like this:
Take as a working half-interval)