ENTROPY ENCODING
In information theory an 'entropy encoding' is a data compression scheme that assigns codes to symbols so as to match code lengths with the probabilities of the symbols. Typically, entropy encoders are used to compress data by replacing symbols represented by equal-length codes with symbols represented by codes where the length of each codeword is proportional to the negative logarithm of the probability. Therefore, the most common symbols use the shortest codes.
According to Shannon's source coding theorem, the optimal code length for a symbol is −log''bP'', where ''b'' is the number of symbols used to make output codes and ''P'' is the probability of the input symbol.
Two of the most common entropy encoding techniques are Huffman coding and arithmetic coding.
If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression), a simpler static code may be useful.
These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding).
Besides using entropy encoding as a way to compress (and losslessly recover) digital data, an entropy encoder can
also be used to measure the amount of similarity between streams of data. This is done by generating an entropy
coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each
compressor and seeing which compressor yields the highest compression. The coder with the best compression is
probably the coder trained on the data that was most similar to the unknown data.
★ On-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay - gives an accessible introduction to Shannon theory and data compression, including the Huffman coding and arithmetic coding.
★ Spam Filtering using Statistical Data Compression Models by Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam and Blaz Zupan, Journal of Machine Learning Research, Vol 7(Dec), 2006.
----
''An earlier (open content) version of the above article was posted on PlanetMath.''
According to Shannon's source coding theorem, the optimal code length for a symbol is −log''bP'', where ''b'' is the number of symbols used to make output codes and ''P'' is the probability of the input symbol.
Two of the most common entropy encoding techniques are Huffman coding and arithmetic coding.
If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression), a simpler static code may be useful.
These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding).
| Contents |
| Entropy as a measure of similarity |
| External links |
Entropy as a measure of similarity
Besides using entropy encoding as a way to compress (and losslessly recover) digital data, an entropy encoder can
also be used to measure the amount of similarity between streams of data. This is done by generating an entropy
coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each
compressor and seeing which compressor yields the highest compression. The coder with the best compression is
probably the coder trained on the data that was most similar to the unknown data.
External links
★ On-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay - gives an accessible introduction to Shannon theory and data compression, including the Huffman coding and arithmetic coding.
★ Spam Filtering using Statistical Data Compression Models by Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam and Blaz Zupan, Journal of Machine Learning Research, Vol 7(Dec), 2006.
----
''An earlier (open content) version of the above article was posted on PlanetMath.''
This article provided by Wikipedia. To edit the contents of this article, click here for original source.
psst.. try this: add to faves

العربية
中国
Français
Deutsch
Ελληνική
हिन्दी
Italiano
日本語
Português
Русский
Español