Member Login
Username:Password:
or Sign up here
Discover

DATA COMPRESSION


In computer science and information theory, 'data compression' or 'source coding' is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp." One popular instance of compression with which many computer users are familiar is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file.
As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.
Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard), and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

Contents
Lossless vs. Lossy Compression
Applications
Theory
Comparative
See also
Data compression topics
Compression algorithms
Lossless data compression
Lossy data compression
Example implementations
Corpora
References
External links

Lossless vs. Lossy Compression


''Lossless'' compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has ''statistical redundancy''. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called ''lossy data compression'', is possible if some loss of fidelity is acceptable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly (i.e. may not even notice compression artifacts). Similarly, two clips of audio may be perceived as the same to a listener even though one is missing details found in the other. Lossy data compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
:888883333333
This string, written an uncompressed form. could have been writen as:
:8[5]3[7].
Interpreted as, "5 eights, 7 threes", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
:83
instead, the original data is lost, at the benefit of a smaller filesize.

Applications


The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).
For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.
Lossy image compression is used in digital cameras, greatly increasing their storage capacities while hardly degrading picture quality at all. Similarly, DVDs use the lossy MPEG-2 codec for video compression.
In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.

Theory


The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering". Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX).
A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format.
The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text ''entry'' system, Dasher, is an inverse-arithmetic-coder.
Matt Mahoney, one of the 3 founders of the Hutter Prize, claims that "Compression is Equivalent to General Intelligence" [1].

Comparative


Independent comparison of different methods of data compression (Results & Softwares, in French. Airelle, 2007). Numbers in parenthesis are the rank of the method of compression for the category of file specified above.

★ Text files, such as .htm or .txt, can be hard compressed.

★ Some files are already compressed (e.g. .mp3 or .zip), so the compression rate of such files is poor. Due to the addition of header data, often there are diminishing returns in such compression, causing the file to actually be slightly larger upon storage.

★ To be more representative of the performance, the global score (/20) is calculated with a non-parametric formula after the sum of the ranks (1 to 20) for each of the 20 tested methods.



























Comparison of different methods of data compression
Files
★ .avi

★ .dll

★ .doc

★ .exe

★ .gif

★ .htm

★ .jpg

★ .mp3

★ .mpg

★ .pdf

★ .txt

★ .wav

★ .zip
NotationTOTAL
Number
of
files
1626138242467944298368119 674
Initial
size
5,261,1525,254,2205,254,6565,254,0565,246,2095,261,1875,246,1165,250,4325,257,7205,257,8765,253,4365,256,0245,262,680 68,315,764
7z4,524,067 (2)1,543,179 (3)147,690 (3)3,910,541 (3)4 620 354 (1)341,996 (4)4,770,061 (4)5,053,813 (2)4,879,067 (5)4,258,863 (3)1,270,884 (3)3,670,225 (5)5,226,742 (14)16/2044,217,482
arj4,696,659 (9)2,160,530 (15)1,018,050 (17)4,130,505 (11)4,702,449 (12)898,370 (17)4,803,740 (11)5,108,093 (17)4,910,699 (16)4,606,736 (15)1,875,329 (16)4,450,535 (12)5,223,905 (13)6.1/2048,585,600
bh4,703,291 (12)2,156,986 (12)1,010,284 (15)4,128,594 (9)4,693,021 (9)889,650 (15)4,806,914 (13)5,105,811 (13)4,904,209 (11)4,601,545 (13)1,848,972 (13)4,451,648 (15)5,201,639 (4)7.5/2048,502,564
bz24,720,926 (18)2,095,832 (7)573,721 (5)4,273,885 (18)4,896,084 (18)645,243 (5)4,743,918 (2)5,069,593 (4)4,888,293 (7)4,444,829 (5)1,531,448 (6)3,771,508 (7)5,238,677 (16)11.7/2046,893,957
bza4,639,340 (6)2,166,940 (17)987,806 (11)4,231,254 (17)4,878,327 (17)783,188 (8)4,787,973 (7)5,076,189 (5)4,873,810 (2)4,618,970 (17)1,516,326 (5)3,770,938 (6)5,227,572 (15)9.8/2047,558,633
cab4,701,113 (11)2,148,386 (10)893,796 (7)4,127,044 (8)4,678,810 (5)842,129 (10)4,798,500 (8)5,099,787 (8)4,900,314 (10)4,584,969 (8)1,846,233 (12)4,451,857 (18)5,201,717 (5)10.8/2048,274,655
gza4,703,371 (13)2,157,116 (13)1,001,990 (13)4,126,436 (7)4,693,136 (10)874,444 (12)4,803,739 (10)5,105,765 (12)4,904,249 (12)4,597,720 (11)1,840,188 (11)4,451,638 (14)5,201,436 (3)9.2/2048,461,228
j4,678,506 (8)1,914,777 (5)703,722 (6)4,057,445 (5)4,681,437 (6)691,916 (6)4,805,059 (12)5,092,070 (7)4,898,847 (8)4,326,394 (4)1,629,228 (8)3,594,954 (4)5,215,150 (12)13/2046,289,505
jar4,704,088 (14)2,158,273 (14)1,017,205 (16)4,129,816 (10)4,705,456 (13)893,622 (16)4,809,136 (16)5,107,254 (15)4,904,615 (13)4,603,367 (14)1,849,394 (14)4,451,718 (16)5,202,611 (8)6.2/2048,536,555
lha4,711,090 (16)2,215,476 (18)1,020,194 (18)4,204,071 (15)4,830,501 (15)913,845 (18)4,918,792 (19)5,206,933 (19)5,066,716 (19)4,802,049 (19)1,895,771 (17)4,447,253 (10)5,263,136 (18)6.7/2049,495,827
lzh4,711,090 (16)2,215,476 (18)1,066,340 (19)4,143,461 (14)4,819,157 (14)971,166 (19)4,816,349 (18)5,107,584 (16)4,924,974 (18)4,635,416 (18)1,945,961 (19)4,449,756 (11)5,212,837 (11)5.3/2049,019,567
pkz4,899,083 (20)2,354,373 (20)1,173,097 (20)4,401,289 (20)5,120,590 (19)1,018,250 (20)5,162,114 (20)5,253,006 (20)5,203,747 (20)5,076,577 (20)2,084,290 (20)5,027,854 (20)5,264,213 (19)0.2/2052,038,483
rar4,634,009 (5)1,693,150 (4)173,313 (4)3,948,241 (4)4,639,881 (4)318,269 (3)4,780,095 (6)5,081 085 (6)4,887,973 (6)4,258,775 (2)1,318,381 (4)2,657,731 (3)5,202,579 (7)15.5/2043,593,482
rk4,589,894 (3)1,474,339 (2)132,629 (1)3,866,814 (1)4,628,017 (3)257,588 (1)4,434,701 (1)5,017,545 (1)4,787,286 (1)4,498,992 (6)1,168,720 (1)1,659,771 (1)5,183,337 (1)18.2/2041,699,633
rs4,625,725 (4)2,137,145 (9)937,954 (10)4,221,864 (16)4,850,493 (16)768,711 (7)4,776,635 (5)5,066,886 (3)4,878,852 (3)4,612,537 (16)1,560,879 (7)3,804,335 (8)5,240,116 (17)10.7/2047,482,132
sqx4,662,560 (7)2,078,866 (6)991,992 (12)4,105,933 (6)4,699,518 (11)878,469 (14)4,808,697 (15)5,102,452 (10)4,908,341 (14)4,590,245 (10)1,836,245 (9)4,415,575 (9)5,208,275 (10)9.8/2048,287,168
tgz4,707,481 (15)2,165,409 (16)907,006 (8)4,133,949 (12)4,684,949 (7)861,638 (11)4,807,701 (14)5,105,913 (14)4,909,789 (15)4,588,822 (9)1,853,650 (15)4,451,792 (17)5,202,392 (6)7.8/2048,380,491
uha4,498,275 (1)1,474,005 (1)136,880 (2)3,879,360 (2)4,625,014 (2)284,363 (2)4,760,572 (3)5,104,837 (11)4,879,047 (4)4,237,400 (1)1,233,812 (2)2,435,124 (2)5,187,408 (2)17.3/2044,736,097
yz14,814,935 (19)2,128,899 (8)924,706 (9)4,279,162 (19)4,686,669 (8)804,198 (9)4,810,966 (17)5,124,596 (18)4,922,886 (17)4,568,274 (7)1,901,300 (18)4,561,179 (19)5,207,874 (9)6.4/2048,735,644
zip4,701,064 (10)2,155,923 (11)1,009,814 (14)4,135,619 (13)5,270,565 (20)877,679 (13)4,799,508 (9)5,101,205 (9)4,898,961 (9)4,599,883 (12)1,839,080 (10)4,450,719 (13)5,264,564 (20)7.5/2049,104,584
Intermediate
compressed
size
4,701,0892,152,155962,8804,130,1604,696,327851,8844,803,7405,103,6454,902,2624,593,9831,839,6344,448,5055,210,556 48,519,559
Intermediate
compression
rate
10.6 %59.0 %81.7 %21.4 %10.5 %83.8 %8.4 %2.8 %6.8 %12.6 %65.0 %15.4 %1.0 % 29.0 %

P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.
Globally, the three best methods tested are rk, rar and 7z. WinRK and WinRar are commercial software, 7-zip is free, open source (LGPL licence) and can be used with Linux.

See also


Data compression topics



Algorithmic complexity theory

Information entropy

Self-extracting archive

Image compression

Speech compression

Video compression

Multimedia compression

Minimum description length

Minimum message length (two-part lossless compression designed for inference)

List of archive formats

List of file archivers

Comparison of file archivers

List of Unix programs

Free file format

HTTP compression

Reverse Delta

Magic compression algorithm

Compression algorithms

Lossless data compression


run-length encoding

dictionary coders


LZ77 & LZ78


LZW

Burrows-Wheeler transform

prediction by partial matching (also known as PPM)

context mixing

Dynamic Markov Compression (DMC)

entropy encoding


Huffman coding (simple entropy coding; commonly used as the final stage of compression)


Adaptive Huffman coding


arithmetic coding (more advanced)



Shannon-Fano coding



range encoding (same as arithmetic coding, but looked at in a slightly different way)


T-code, A variant of Huffman code


Golomb coding (simple entropy coding for infinite input data with a geometric distribution)


universal codes (entropy coding for infinite input data with an arbitrary distribution)



Elias gamma coding



Fibonacci coding
Lossy data compression


discrete cosine transform

fractal compression


fractal transform

wavelet compression

vector quantization

linear predictive coding

Distributed Source Coding Using Syndromes, for correlated data

Modulo-N code for correlated data

A-law Compander

Mu-law Compander
Example implementations


DEFLATE (a combination of LZ77 and Huffman coding) – used by ZIP, gzip and PNG files

LZMA used by 7-Zip and, to a lesser extent, StuffitX

LZO (very fast LZ variation, speed oriented)

LZX (an LZ77 family compression algorithm)

Unix ''compress'' utility (the .Z file format), and GIF use LZW

★ Unix ''pack'' utility (the .z file format) used Huffman coding

bzip2 (a combination of the Burrows-Wheeler transform and Huffman coding)

PAQ (very high compression based on context mixing, but extremely slow; competing in the top of the highest compression competitions)

JPEG (image compression using a discrete cosine transform, then quantization, then Huffman coding)

MPEG (audio and video compression standards family in wide use, using DCT and motion-compensated prediction for video)


MP3 (a part of the MPEG-1 standard for sound and music compression, using subbanding and MDCT, perceptual modeling, quantization, and Huffman coding)


AAC (part of the MPEG-2 and MPEG-4 audio coding specifications, using MDCT, perceptual modeling, quantization, and Huffman coding)

Vorbis (DCT based AAC-alike audio codec, designed with a focus on avoiding patent encumbrance)

JPEG 2000 (image compression using wavelets, then quantization, then entropy coding)

TTA (uses linear predictive coding for lossless audio compression)

FLAC (linear predictive coding for lossless audio compression)
Corpora

Data collections, commonly used for comparing compression algorithms.

Canterbury Corpus

Calgary Corpus

References


1. Rationale for a Large Text Compression Benchmark

External links



Data Compression Benchmarks and Tests

Data Compression Tutorial

Compression Comparison Guide on various settings

Large Data Compression Benchmarks and Tests

Almost complete portraits of Data Compression inventors

Data Compression - Systematisation by T.Strutz

Lossless Data Compression by Greg Goebel

How Stuff Works - File Compression

Ultimate Command Line Compressors

The Data Compression News Blog

Practical Compressor Test (Compares speed and efficiency for commonly used compression programs)

The Monthly Data Compression Newsletter

Compressed File Types and File Extensions

Image and Video Compression Learning Tool (VcDemo)

This article provided by Wikipedia. To edit the contents of this article, click here for original source.