Do lossless compression algorithms reduce entropy?












32












$begingroup$


According to Wikipedia:




Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc.




So entropy is a measure of the amount of information contained in a message. Entropy coders are used to losslessy compress such a message to the minimum number of bits needed to represent it (entropy). To me this looks like a perfect entropy encoder would be all that is needed to losslessy compress a message as much as possible.



Many compression algorithms however use steps before entropy coding to supposedly reduce the entropy of the message.



According to german Wikipedia




Entropiekodierer werden häufig mit anderen Kodierern kombiniert. Dabei dienen vorgeschaltete Verfahren dazu, die Entropie der Daten zu verringern.




In english:




Entropy coders are frequently combined with other encoders. Previous steps serve to reduce the entropy of the data.




i.e. bzip2 uses the Burrows-Wheeler-Transform followed by a Move-To-Front-Transform before applying entropy coding (Huffman coding in this case).



Do these steps really reduce the entropy of the message, which would imply reducing the amount of information contained in the message? This seems contradictory to me, since that would mean that information was lost during compression, preventing lossless decompression. Or do they merely transform the message to improve the efficiency of the entropy coding algorithm? Or does entropy not correspond directly to the amount of information in the message?










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    Could be a way to estimate the entropy though.
    $endgroup$
    – pipe
    Jan 10 at 13:25
















32












$begingroup$


According to Wikipedia:




Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc.




So entropy is a measure of the amount of information contained in a message. Entropy coders are used to losslessy compress such a message to the minimum number of bits needed to represent it (entropy). To me this looks like a perfect entropy encoder would be all that is needed to losslessy compress a message as much as possible.



Many compression algorithms however use steps before entropy coding to supposedly reduce the entropy of the message.



According to german Wikipedia




Entropiekodierer werden häufig mit anderen Kodierern kombiniert. Dabei dienen vorgeschaltete Verfahren dazu, die Entropie der Daten zu verringern.




In english:




Entropy coders are frequently combined with other encoders. Previous steps serve to reduce the entropy of the data.




i.e. bzip2 uses the Burrows-Wheeler-Transform followed by a Move-To-Front-Transform before applying entropy coding (Huffman coding in this case).



Do these steps really reduce the entropy of the message, which would imply reducing the amount of information contained in the message? This seems contradictory to me, since that would mean that information was lost during compression, preventing lossless decompression. Or do they merely transform the message to improve the efficiency of the entropy coding algorithm? Or does entropy not correspond directly to the amount of information in the message?










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    Could be a way to estimate the entropy though.
    $endgroup$
    – pipe
    Jan 10 at 13:25














32












32








32


16



$begingroup$


According to Wikipedia:




Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc.




So entropy is a measure of the amount of information contained in a message. Entropy coders are used to losslessy compress such a message to the minimum number of bits needed to represent it (entropy). To me this looks like a perfect entropy encoder would be all that is needed to losslessy compress a message as much as possible.



Many compression algorithms however use steps before entropy coding to supposedly reduce the entropy of the message.



According to german Wikipedia




Entropiekodierer werden häufig mit anderen Kodierern kombiniert. Dabei dienen vorgeschaltete Verfahren dazu, die Entropie der Daten zu verringern.




In english:




Entropy coders are frequently combined with other encoders. Previous steps serve to reduce the entropy of the data.




i.e. bzip2 uses the Burrows-Wheeler-Transform followed by a Move-To-Front-Transform before applying entropy coding (Huffman coding in this case).



Do these steps really reduce the entropy of the message, which would imply reducing the amount of information contained in the message? This seems contradictory to me, since that would mean that information was lost during compression, preventing lossless decompression. Or do they merely transform the message to improve the efficiency of the entropy coding algorithm? Or does entropy not correspond directly to the amount of information in the message?










share|cite|improve this question









$endgroup$




According to Wikipedia:




Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc.




So entropy is a measure of the amount of information contained in a message. Entropy coders are used to losslessy compress such a message to the minimum number of bits needed to represent it (entropy). To me this looks like a perfect entropy encoder would be all that is needed to losslessy compress a message as much as possible.



Many compression algorithms however use steps before entropy coding to supposedly reduce the entropy of the message.



According to german Wikipedia




Entropiekodierer werden häufig mit anderen Kodierern kombiniert. Dabei dienen vorgeschaltete Verfahren dazu, die Entropie der Daten zu verringern.




In english:




Entropy coders are frequently combined with other encoders. Previous steps serve to reduce the entropy of the data.




i.e. bzip2 uses the Burrows-Wheeler-Transform followed by a Move-To-Front-Transform before applying entropy coding (Huffman coding in this case).



Do these steps really reduce the entropy of the message, which would imply reducing the amount of information contained in the message? This seems contradictory to me, since that would mean that information was lost during compression, preventing lossless decompression. Or do they merely transform the message to improve the efficiency of the entropy coding algorithm? Or does entropy not correspond directly to the amount of information in the message?







information-theory data-compression entropy






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Jan 10 at 8:50









robertrobert

16327




16327








  • 1




    $begingroup$
    Could be a way to estimate the entropy though.
    $endgroup$
    – pipe
    Jan 10 at 13:25














  • 1




    $begingroup$
    Could be a way to estimate the entropy though.
    $endgroup$
    – pipe
    Jan 10 at 13:25








1




1




$begingroup$
Could be a way to estimate the entropy though.
$endgroup$
– pipe
Jan 10 at 13:25




$begingroup$
Could be a way to estimate the entropy though.
$endgroup$
– pipe
Jan 10 at 13:25










6 Answers
6






active

oldest

votes


















37












$begingroup$

A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, "information due to independent events is additive."



In other words, independent events must be statistically independent. If they aren't, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.



To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don't fit the assumptions of Shannon entropy, consider...



Markov processes



A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you're reading right now!



The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:



$$ H(mathcal{S}) = - sum_i p_i sum_j p_i (j) log p_i (j) $$



This can also be represented like so:



$$ H(Y) = - sum_{ij} mu_i P_{ij} log P_{ij} $$



Again quoting Wikipedia, here "$mu_i$ is the asymptotic distribution of the chain" -- that is, the overall probability that a given event will occur over a long horizon.



This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:




  • They ran to the tree

  • The tree ran to they

  • Tree the they to ran


But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.



Entropy rates are model-dependent



If you zoom way out, here's the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You'll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.



And very frequently, your model of the process isn't going to be quite correct. This isn't a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don't know what the true underlying process is. This is a central result in algorithmic information theory.



What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it's impossible to know which is correct in the long run -- although the one that assigns the lowest entropy is probably the best.






share|cite|improve this answer











$endgroup$









  • 2




    $begingroup$
    Thank you very much! This explains perfectly what the mistake in my reasoning was.
    $endgroup$
    – robert
    Jan 10 at 19:38










  • $begingroup$
    Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
    $endgroup$
    – piiperi
    Jan 12 at 19:24












  • $begingroup$
    @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
    $endgroup$
    – senderle
    Jan 12 at 20:03










  • $begingroup$
    @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
    $endgroup$
    – piiperi
    Jan 12 at 21:04





















11












$begingroup$

No, if the algorithm is lossless no steps in the compression sequence can reduce its entropy - otherwise it would not be able to be decompressed/decoded. However, the additional entropy may be stored in 'out-of-band' information - such as the list that needs to be maintained in order to decode the move-to-front transform.






share|cite|improve this answer









$endgroup$













  • $begingroup$
    So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
    $endgroup$
    – robert
    Jan 10 at 12:51










  • $begingroup$
    Indeed, it doesn't (well, depending on the exact meaning of "close").
    $endgroup$
    – Grimy
    Jan 10 at 12:55












  • $begingroup$
    The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
    $endgroup$
    – Luke Schwartzkopff
    Jan 10 at 20:55










  • $begingroup$
    No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
    $endgroup$
    – immibis
    Jan 10 at 23:21










  • $begingroup$
    Aah, you're right, that wasn't the best example :)
    $endgroup$
    – Luke Schwartzkopff
    Jan 11 at 8:22



















6












$begingroup$

Entropy coders do not compress the message to the minimum number of bits needed to represent it. I know it's tempting to think that, but it's not what they do. They aren't magic and they can't achieve that.



Instead, they do something a bit less magical -- but still useful. Suppose for the moment that we knew that each character of the message was chosen independently from some distribution. Then it would be possible to build a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.



Now real messages usually don't have that independence property. For instance, if you see a Q, it's likely that the next letter is a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message, where each character isn't chosen independently of the rest. The algorithm will still be lossless, it can still be used for compression, and in practice, it will still often shorten the length of the message. However, it doesn't shorten it to the minimum possible length. It doesn't compress the message to something whose length is equal to the entropy of the message; it compresses it less than that.



Once you realize this property of entropy encoders, then the paradox evaporates.



In general, any lossless step never reduces the entropy of the message. However, it might put the message into a form where some other compression algorithm is more effective, so it might still be useful (on average) in practice.






share|cite|improve this answer









$endgroup$





















    5












    $begingroup$

    They reduce the apparent entropy inherent in the structure of the original message. Or in other words they tune the message to make use of the strengths of the next stages of compression.



    One simple example would be replacing the name in the end tags of xml with a special symbol. You can perfectly recreate the original xml from that but the compressor doesn't have to include the full name again in that place.



    A more real-world example is png compression. It's entropy compressor is DEFLATE, which is a combination of Lempel-Ziff and Huffman. This means that it works best with values and patterns that repeat often. Most adjacent pixels tend to be similar colors. So each row is assigned a filter which turn the original pixel values into a differential encoding. This way the values that end up encoded by DEFLATE are mostly close to 0. In the extreme case this will turn a smooth gradient from all different values into a single value throughout the row which the LZ portion or DEFLATE makes very quick work of.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
      $endgroup$
      – robert
      Jan 10 at 10:04










    • $begingroup$
      with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
      $endgroup$
      – ratchet freak
      Jan 10 at 11:55










    • $begingroup$
      But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
      $endgroup$
      – robert
      Jan 10 at 12:48






    • 1




      $begingroup$
      @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
      $endgroup$
      – robert
      Jan 10 at 13:03






    • 1




      $begingroup$
      @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
      $endgroup$
      – kutschkem
      Jan 10 at 13:19



















    2












    $begingroup$

    The word "Entropy" if often used a bit loosely, to refer to two different things:




    • The "total amount of information" in a message or system


    • The information "density", or how tightly the information is packed.



    OP's quote of the Wikipedia's entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:



    Shannon's entropy measures the information contained in a message


    But (at least when I'm writing this) the same article starts with:



    Information entropy is the average rate at which information is produced by a stochastic source of data.


    So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes called "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).



    A classic example of the distinction is Paul Revere's famous lantern signal: "one if by land, and two if by sea". 1 bit of total information (if we ignore the "none if I haven't gotten to North Church yet" case). If Paul added another set of lanterns in each window of the building, that would be '''redundant''': no more information, so the same "total" or "extensive" entropy; but much more message length, so much lower "intensive" entropy.



    If he starts that way but changes to use only one set of lanterns, that's "lossless compression" like in OP's question. The "extensive" entropy is the same, but the "intensive" entropy" is different: Because the number of lanterns in the 2nd window is highly correlated with how many you've seen in the first, the redundant message is more predictable, or less random, so has much lower intensive entropy.



    There are two other important things to remember:




    • First, we typically don't know the "true" entropy of a system in either sense. A naive bystander doesn't know whether "3 lanterns" would be different message, or whether signals in different window are redundant or not. If Paul makes his ride a habit, we can count and see whether the windows always match each other. But maybe we just haven't watched long enough to see the rare (and probably important!) exceptions.



    • Second, it matters how you measure. Consider trying to estimate how much is communicated by each successive letter of text (that's a rate, so "intensive" entropy, also sometimes called "relative entropy"):




      • If you just notice that people send text around in 8-bit units, your first "estimate" might be 8 bits per letter.

      • If you count the number of distinct letters being used, you'd estimate log2(26), or 4.7 bits per letter (a bit higher if you consider spaces, case, etc).

      • If you consider that "e" is a better bet for "next letter" than "z", you'll measure letter frequencies and get around 4.14 (see http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm).

      • If you count letter-pairs you'll pick up on patterns like "qu", "th", etc., and get around 3.56.

      • If you count sequences of up to around 5 letters, you'll get still lower values, and as a bonus you can pretty reliably distinguish what human language the text is in).

      • If you're as hard-core and clever as N. G. Burton and J. C. R. Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68(1955)), you can get up to sequences of 10,0000 letters in a row, and find yet another entropy value.




    But of course, messages can (and do) have many patterns that aren't modelled by such n-gram methods, so the "true" entropy is lower still.



    If you model a theoretical infinite source with perfectly random Zipfian distribution of tokens, you can calculate the extensive and intensive entropy it would have, which turns out to depend only on the number of possible distinct tokens. Graphs of what each type of entropy looks like as that number increases, are in [http://www.derose.net/steve/writings/dissertation/Diss.0.html]. The two behave quite differently:




    • total (extensive): http://www.derose.net/steve/writings/dissertation/Graph13.png

    • relative (intensive): http://www.derose.net/steve/writings/dissertation/Graph14.png


    Hope that helps or is at least interesting...






    share|cite|improve this answer









    $endgroup$





















      1












      $begingroup$

      I suspect the wording in the German Wikipedia is mistaken. Compressors increase the entropy. That is to say, not the overall entropy, but the entropy per bit: the information density. E.g. some run-length encoding and dictionary scheme is applied to condense the data. Now the same information is packed into fewer bits, so each bit carries more information. The subsequent Huffman coding does a little more of the same; it's just another layer of compression.






      share|cite|improve this answer









      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "419"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f102647%2fdo-lossless-compression-algorithms-reduce-entropy%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        6 Answers
        6






        active

        oldest

        votes








        6 Answers
        6






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        37












        $begingroup$

        A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, "information due to independent events is additive."



        In other words, independent events must be statistically independent. If they aren't, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.



        To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don't fit the assumptions of Shannon entropy, consider...



        Markov processes



        A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you're reading right now!



        The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:



        $$ H(mathcal{S}) = - sum_i p_i sum_j p_i (j) log p_i (j) $$



        This can also be represented like so:



        $$ H(Y) = - sum_{ij} mu_i P_{ij} log P_{ij} $$



        Again quoting Wikipedia, here "$mu_i$ is the asymptotic distribution of the chain" -- that is, the overall probability that a given event will occur over a long horizon.



        This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:




        • They ran to the tree

        • The tree ran to they

        • Tree the they to ran


        But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.



        Entropy rates are model-dependent



        If you zoom way out, here's the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You'll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.



        And very frequently, your model of the process isn't going to be quite correct. This isn't a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don't know what the true underlying process is. This is a central result in algorithmic information theory.



        What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it's impossible to know which is correct in the long run -- although the one that assigns the lowest entropy is probably the best.






        share|cite|improve this answer











        $endgroup$









        • 2




          $begingroup$
          Thank you very much! This explains perfectly what the mistake in my reasoning was.
          $endgroup$
          – robert
          Jan 10 at 19:38










        • $begingroup$
          Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
          $endgroup$
          – piiperi
          Jan 12 at 19:24












        • $begingroup$
          @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
          $endgroup$
          – senderle
          Jan 12 at 20:03










        • $begingroup$
          @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
          $endgroup$
          – piiperi
          Jan 12 at 21:04


















        37












        $begingroup$

        A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, "information due to independent events is additive."



        In other words, independent events must be statistically independent. If they aren't, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.



        To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don't fit the assumptions of Shannon entropy, consider...



        Markov processes



        A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you're reading right now!



        The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:



        $$ H(mathcal{S}) = - sum_i p_i sum_j p_i (j) log p_i (j) $$



        This can also be represented like so:



        $$ H(Y) = - sum_{ij} mu_i P_{ij} log P_{ij} $$



        Again quoting Wikipedia, here "$mu_i$ is the asymptotic distribution of the chain" -- that is, the overall probability that a given event will occur over a long horizon.



        This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:




        • They ran to the tree

        • The tree ran to they

        • Tree the they to ran


        But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.



        Entropy rates are model-dependent



        If you zoom way out, here's the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You'll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.



        And very frequently, your model of the process isn't going to be quite correct. This isn't a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don't know what the true underlying process is. This is a central result in algorithmic information theory.



        What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it's impossible to know which is correct in the long run -- although the one that assigns the lowest entropy is probably the best.






        share|cite|improve this answer











        $endgroup$









        • 2




          $begingroup$
          Thank you very much! This explains perfectly what the mistake in my reasoning was.
          $endgroup$
          – robert
          Jan 10 at 19:38










        • $begingroup$
          Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
          $endgroup$
          – piiperi
          Jan 12 at 19:24












        • $begingroup$
          @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
          $endgroup$
          – senderle
          Jan 12 at 20:03










        • $begingroup$
          @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
          $endgroup$
          – piiperi
          Jan 12 at 21:04
















        37












        37








        37





        $begingroup$

        A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, "information due to independent events is additive."



        In other words, independent events must be statistically independent. If they aren't, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.



        To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don't fit the assumptions of Shannon entropy, consider...



        Markov processes



        A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you're reading right now!



        The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:



        $$ H(mathcal{S}) = - sum_i p_i sum_j p_i (j) log p_i (j) $$



        This can also be represented like so:



        $$ H(Y) = - sum_{ij} mu_i P_{ij} log P_{ij} $$



        Again quoting Wikipedia, here "$mu_i$ is the asymptotic distribution of the chain" -- that is, the overall probability that a given event will occur over a long horizon.



        This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:




        • They ran to the tree

        • The tree ran to they

        • Tree the they to ran


        But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.



        Entropy rates are model-dependent



        If you zoom way out, here's the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You'll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.



        And very frequently, your model of the process isn't going to be quite correct. This isn't a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don't know what the true underlying process is. This is a central result in algorithmic information theory.



        What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it's impossible to know which is correct in the long run -- although the one that assigns the lowest entropy is probably the best.






        share|cite|improve this answer











        $endgroup$



        A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, "information due to independent events is additive."



        In other words, independent events must be statistically independent. If they aren't, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.



        To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don't fit the assumptions of Shannon entropy, consider...



        Markov processes



        A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you're reading right now!



        The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:



        $$ H(mathcal{S}) = - sum_i p_i sum_j p_i (j) log p_i (j) $$



        This can also be represented like so:



        $$ H(Y) = - sum_{ij} mu_i P_{ij} log P_{ij} $$



        Again quoting Wikipedia, here "$mu_i$ is the asymptotic distribution of the chain" -- that is, the overall probability that a given event will occur over a long horizon.



        This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:




        • They ran to the tree

        • The tree ran to they

        • Tree the they to ran


        But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.



        Entropy rates are model-dependent



        If you zoom way out, here's the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You'll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.



        And very frequently, your model of the process isn't going to be quite correct. This isn't a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don't know what the true underlying process is. This is a central result in algorithmic information theory.



        What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it's impossible to know which is correct in the long run -- although the one that assigns the lowest entropy is probably the best.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Jan 11 at 16:53

























        answered Jan 10 at 16:06









        senderlesenderle

        54637




        54637








        • 2




          $begingroup$
          Thank you very much! This explains perfectly what the mistake in my reasoning was.
          $endgroup$
          – robert
          Jan 10 at 19:38










        • $begingroup$
          Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
          $endgroup$
          – piiperi
          Jan 12 at 19:24












        • $begingroup$
          @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
          $endgroup$
          – senderle
          Jan 12 at 20:03










        • $begingroup$
          @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
          $endgroup$
          – piiperi
          Jan 12 at 21:04
















        • 2




          $begingroup$
          Thank you very much! This explains perfectly what the mistake in my reasoning was.
          $endgroup$
          – robert
          Jan 10 at 19:38










        • $begingroup$
          Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
          $endgroup$
          – piiperi
          Jan 12 at 19:24












        • $begingroup$
          @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
          $endgroup$
          – senderle
          Jan 12 at 20:03










        • $begingroup$
          @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
          $endgroup$
          – piiperi
          Jan 12 at 21:04










        2




        2




        $begingroup$
        Thank you very much! This explains perfectly what the mistake in my reasoning was.
        $endgroup$
        – robert
        Jan 10 at 19:38




        $begingroup$
        Thank you very much! This explains perfectly what the mistake in my reasoning was.
        $endgroup$
        – robert
        Jan 10 at 19:38












        $begingroup$
        Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
        $endgroup$
        – piiperi
        Jan 12 at 19:24






        $begingroup$
        Your answer would be even better if it had data, image and audio decompressors as examples of modeled processes. In e.g. LZ data compression, the model assumes a machine (decoder) that takes as input commands like (D,L): “copy to output L contiguous symbols from offset D relative to the current output position”, or (c): “copy symbol c to the current output position”. The LZ encoder transforms its input symbol stream to the decoder's command language, and the command symbol stream has a different entropy (and length) than the encoded stream. Other types of compression have different machines.
        $endgroup$
        – piiperi
        Jan 12 at 19:24














        $begingroup$
        @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
        $endgroup$
        – senderle
        Jan 12 at 20:03




        $begingroup$
        @piiperi that does sound helpful—I don't know any of those details though. (I'm coming at the question from a machine learning standpoint.)
        $endgroup$
        – senderle
        Jan 12 at 20:03












        $begingroup$
        @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
        $endgroup$
        – piiperi
        Jan 12 at 21:04






        $begingroup$
        @senderle I meant expanding the chapter "Entropy rates are model-dependent" with some concrete process examples. You talk about a process that generates events, and data, image, video, audio etc. compressors' processing components can be seen as such processes. A pure entropy coder is the final step of a data compression pipeline. None of the pipeline steps really "reduce entropy". Instead, each of them creates instructions for a machine that can reproduce the original symbol stream. And each instruction stream has a different entropy and often a different (i.e. shorter) length.
        $endgroup$
        – piiperi
        Jan 12 at 21:04













        11












        $begingroup$

        No, if the algorithm is lossless no steps in the compression sequence can reduce its entropy - otherwise it would not be able to be decompressed/decoded. However, the additional entropy may be stored in 'out-of-band' information - such as the list that needs to be maintained in order to decode the move-to-front transform.






        share|cite|improve this answer









        $endgroup$













        • $begingroup$
          So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
          $endgroup$
          – robert
          Jan 10 at 12:51










        • $begingroup$
          Indeed, it doesn't (well, depending on the exact meaning of "close").
          $endgroup$
          – Grimy
          Jan 10 at 12:55












        • $begingroup$
          The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
          $endgroup$
          – Luke Schwartzkopff
          Jan 10 at 20:55










        • $begingroup$
          No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
          $endgroup$
          – immibis
          Jan 10 at 23:21










        • $begingroup$
          Aah, you're right, that wasn't the best example :)
          $endgroup$
          – Luke Schwartzkopff
          Jan 11 at 8:22
















        11












        $begingroup$

        No, if the algorithm is lossless no steps in the compression sequence can reduce its entropy - otherwise it would not be able to be decompressed/decoded. However, the additional entropy may be stored in 'out-of-band' information - such as the list that needs to be maintained in order to decode the move-to-front transform.






        share|cite|improve this answer









        $endgroup$













        • $begingroup$
          So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
          $endgroup$
          – robert
          Jan 10 at 12:51










        • $begingroup$
          Indeed, it doesn't (well, depending on the exact meaning of "close").
          $endgroup$
          – Grimy
          Jan 10 at 12:55












        • $begingroup$
          The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
          $endgroup$
          – Luke Schwartzkopff
          Jan 10 at 20:55










        • $begingroup$
          No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
          $endgroup$
          – immibis
          Jan 10 at 23:21










        • $begingroup$
          Aah, you're right, that wasn't the best example :)
          $endgroup$
          – Luke Schwartzkopff
          Jan 11 at 8:22














        11












        11








        11





        $begingroup$

        No, if the algorithm is lossless no steps in the compression sequence can reduce its entropy - otherwise it would not be able to be decompressed/decoded. However, the additional entropy may be stored in 'out-of-band' information - such as the list that needs to be maintained in order to decode the move-to-front transform.






        share|cite|improve this answer









        $endgroup$



        No, if the algorithm is lossless no steps in the compression sequence can reduce its entropy - otherwise it would not be able to be decompressed/decoded. However, the additional entropy may be stored in 'out-of-band' information - such as the list that needs to be maintained in order to decode the move-to-front transform.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Jan 10 at 11:47









        Luke SchwartzkopffLuke Schwartzkopff

        1213




        1213












        • $begingroup$
          So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
          $endgroup$
          – robert
          Jan 10 at 12:51










        • $begingroup$
          Indeed, it doesn't (well, depending on the exact meaning of "close").
          $endgroup$
          – Grimy
          Jan 10 at 12:55












        • $begingroup$
          The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
          $endgroup$
          – Luke Schwartzkopff
          Jan 10 at 20:55










        • $begingroup$
          No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
          $endgroup$
          – immibis
          Jan 10 at 23:21










        • $begingroup$
          Aah, you're right, that wasn't the best example :)
          $endgroup$
          – Luke Schwartzkopff
          Jan 11 at 8:22


















        • $begingroup$
          So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
          $endgroup$
          – robert
          Jan 10 at 12:51










        • $begingroup$
          Indeed, it doesn't (well, depending on the exact meaning of "close").
          $endgroup$
          – Grimy
          Jan 10 at 12:55












        • $begingroup$
          The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
          $endgroup$
          – Luke Schwartzkopff
          Jan 10 at 20:55










        • $begingroup$
          No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
          $endgroup$
          – immibis
          Jan 10 at 23:21










        • $begingroup$
          Aah, you're right, that wasn't the best example :)
          $endgroup$
          – Luke Schwartzkopff
          Jan 11 at 8:22
















        $begingroup$
        So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
        $endgroup$
        – robert
        Jan 10 at 12:51




        $begingroup$
        So are the additional steps used in compression algorithms before entropy coding just used to allow the entropy coder to come closer to entropy? Does an entropy coder not come close to entropy on its own when applied to an arbitrary message?
        $endgroup$
        – robert
        Jan 10 at 12:51












        $begingroup$
        Indeed, it doesn't (well, depending on the exact meaning of "close").
        $endgroup$
        – Grimy
        Jan 10 at 12:55






        $begingroup$
        Indeed, it doesn't (well, depending on the exact meaning of "close").
        $endgroup$
        – Grimy
        Jan 10 at 12:55














        $begingroup$
        The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
        $endgroup$
        – Luke Schwartzkopff
        Jan 10 at 20:55




        $begingroup$
        The additional steps allow the entropy encoder to maintain the entropy of the original message while reducing superfluous information more effectively than if it were to be applied on its own. Whether you apply the pre-processing or not, entropy will be preserved, but compression would be less effective (you'd end up with a less efficient encoding).
        $endgroup$
        – Luke Schwartzkopff
        Jan 10 at 20:55












        $begingroup$
        No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
        $endgroup$
        – immibis
        Jan 10 at 23:21




        $begingroup$
        No, the move-to-front transform does not output a separate list that must be transferred to the decoder. Unless you mean the initial list.
        $endgroup$
        – immibis
        Jan 10 at 23:21












        $begingroup$
        Aah, you're right, that wasn't the best example :)
        $endgroup$
        – Luke Schwartzkopff
        Jan 11 at 8:22




        $begingroup$
        Aah, you're right, that wasn't the best example :)
        $endgroup$
        – Luke Schwartzkopff
        Jan 11 at 8:22











        6












        $begingroup$

        Entropy coders do not compress the message to the minimum number of bits needed to represent it. I know it's tempting to think that, but it's not what they do. They aren't magic and they can't achieve that.



        Instead, they do something a bit less magical -- but still useful. Suppose for the moment that we knew that each character of the message was chosen independently from some distribution. Then it would be possible to build a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.



        Now real messages usually don't have that independence property. For instance, if you see a Q, it's likely that the next letter is a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message, where each character isn't chosen independently of the rest. The algorithm will still be lossless, it can still be used for compression, and in practice, it will still often shorten the length of the message. However, it doesn't shorten it to the minimum possible length. It doesn't compress the message to something whose length is equal to the entropy of the message; it compresses it less than that.



        Once you realize this property of entropy encoders, then the paradox evaporates.



        In general, any lossless step never reduces the entropy of the message. However, it might put the message into a form where some other compression algorithm is more effective, so it might still be useful (on average) in practice.






        share|cite|improve this answer









        $endgroup$


















          6












          $begingroup$

          Entropy coders do not compress the message to the minimum number of bits needed to represent it. I know it's tempting to think that, but it's not what they do. They aren't magic and they can't achieve that.



          Instead, they do something a bit less magical -- but still useful. Suppose for the moment that we knew that each character of the message was chosen independently from some distribution. Then it would be possible to build a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.



          Now real messages usually don't have that independence property. For instance, if you see a Q, it's likely that the next letter is a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message, where each character isn't chosen independently of the rest. The algorithm will still be lossless, it can still be used for compression, and in practice, it will still often shorten the length of the message. However, it doesn't shorten it to the minimum possible length. It doesn't compress the message to something whose length is equal to the entropy of the message; it compresses it less than that.



          Once you realize this property of entropy encoders, then the paradox evaporates.



          In general, any lossless step never reduces the entropy of the message. However, it might put the message into a form where some other compression algorithm is more effective, so it might still be useful (on average) in practice.






          share|cite|improve this answer









          $endgroup$
















            6












            6








            6





            $begingroup$

            Entropy coders do not compress the message to the minimum number of bits needed to represent it. I know it's tempting to think that, but it's not what they do. They aren't magic and they can't achieve that.



            Instead, they do something a bit less magical -- but still useful. Suppose for the moment that we knew that each character of the message was chosen independently from some distribution. Then it would be possible to build a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.



            Now real messages usually don't have that independence property. For instance, if you see a Q, it's likely that the next letter is a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message, where each character isn't chosen independently of the rest. The algorithm will still be lossless, it can still be used for compression, and in practice, it will still often shorten the length of the message. However, it doesn't shorten it to the minimum possible length. It doesn't compress the message to something whose length is equal to the entropy of the message; it compresses it less than that.



            Once you realize this property of entropy encoders, then the paradox evaporates.



            In general, any lossless step never reduces the entropy of the message. However, it might put the message into a form where some other compression algorithm is more effective, so it might still be useful (on average) in practice.






            share|cite|improve this answer









            $endgroup$



            Entropy coders do not compress the message to the minimum number of bits needed to represent it. I know it's tempting to think that, but it's not what they do. They aren't magic and they can't achieve that.



            Instead, they do something a bit less magical -- but still useful. Suppose for the moment that we knew that each character of the message was chosen independently from some distribution. Then it would be possible to build a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.



            Now real messages usually don't have that independence property. For instance, if you see a Q, it's likely that the next letter is a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message, where each character isn't chosen independently of the rest. The algorithm will still be lossless, it can still be used for compression, and in practice, it will still often shorten the length of the message. However, it doesn't shorten it to the minimum possible length. It doesn't compress the message to something whose length is equal to the entropy of the message; it compresses it less than that.



            Once you realize this property of entropy encoders, then the paradox evaporates.



            In general, any lossless step never reduces the entropy of the message. However, it might put the message into a form where some other compression algorithm is more effective, so it might still be useful (on average) in practice.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered Jan 11 at 0:11









            D.W.D.W.

            98.5k11117277




            98.5k11117277























                5












                $begingroup$

                They reduce the apparent entropy inherent in the structure of the original message. Or in other words they tune the message to make use of the strengths of the next stages of compression.



                One simple example would be replacing the name in the end tags of xml with a special symbol. You can perfectly recreate the original xml from that but the compressor doesn't have to include the full name again in that place.



                A more real-world example is png compression. It's entropy compressor is DEFLATE, which is a combination of Lempel-Ziff and Huffman. This means that it works best with values and patterns that repeat often. Most adjacent pixels tend to be similar colors. So each row is assigned a filter which turn the original pixel values into a differential encoding. This way the values that end up encoded by DEFLATE are mostly close to 0. In the extreme case this will turn a smooth gradient from all different values into a single value throughout the row which the LZ portion or DEFLATE makes very quick work of.






                share|cite|improve this answer











                $endgroup$













                • $begingroup$
                  Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                  $endgroup$
                  – robert
                  Jan 10 at 10:04










                • $begingroup$
                  with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                  $endgroup$
                  – ratchet freak
                  Jan 10 at 11:55










                • $begingroup$
                  But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                  $endgroup$
                  – robert
                  Jan 10 at 12:48






                • 1




                  $begingroup$
                  @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                  $endgroup$
                  – robert
                  Jan 10 at 13:03






                • 1




                  $begingroup$
                  @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                  $endgroup$
                  – kutschkem
                  Jan 10 at 13:19
















                5












                $begingroup$

                They reduce the apparent entropy inherent in the structure of the original message. Or in other words they tune the message to make use of the strengths of the next stages of compression.



                One simple example would be replacing the name in the end tags of xml with a special symbol. You can perfectly recreate the original xml from that but the compressor doesn't have to include the full name again in that place.



                A more real-world example is png compression. It's entropy compressor is DEFLATE, which is a combination of Lempel-Ziff and Huffman. This means that it works best with values and patterns that repeat often. Most adjacent pixels tend to be similar colors. So each row is assigned a filter which turn the original pixel values into a differential encoding. This way the values that end up encoded by DEFLATE are mostly close to 0. In the extreme case this will turn a smooth gradient from all different values into a single value throughout the row which the LZ portion or DEFLATE makes very quick work of.






                share|cite|improve this answer











                $endgroup$













                • $begingroup$
                  Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                  $endgroup$
                  – robert
                  Jan 10 at 10:04










                • $begingroup$
                  with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                  $endgroup$
                  – ratchet freak
                  Jan 10 at 11:55










                • $begingroup$
                  But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                  $endgroup$
                  – robert
                  Jan 10 at 12:48






                • 1




                  $begingroup$
                  @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                  $endgroup$
                  – robert
                  Jan 10 at 13:03






                • 1




                  $begingroup$
                  @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                  $endgroup$
                  – kutschkem
                  Jan 10 at 13:19














                5












                5








                5





                $begingroup$

                They reduce the apparent entropy inherent in the structure of the original message. Or in other words they tune the message to make use of the strengths of the next stages of compression.



                One simple example would be replacing the name in the end tags of xml with a special symbol. You can perfectly recreate the original xml from that but the compressor doesn't have to include the full name again in that place.



                A more real-world example is png compression. It's entropy compressor is DEFLATE, which is a combination of Lempel-Ziff and Huffman. This means that it works best with values and patterns that repeat often. Most adjacent pixels tend to be similar colors. So each row is assigned a filter which turn the original pixel values into a differential encoding. This way the values that end up encoded by DEFLATE are mostly close to 0. In the extreme case this will turn a smooth gradient from all different values into a single value throughout the row which the LZ portion or DEFLATE makes very quick work of.






                share|cite|improve this answer











                $endgroup$



                They reduce the apparent entropy inherent in the structure of the original message. Or in other words they tune the message to make use of the strengths of the next stages of compression.



                One simple example would be replacing the name in the end tags of xml with a special symbol. You can perfectly recreate the original xml from that but the compressor doesn't have to include the full name again in that place.



                A more real-world example is png compression. It's entropy compressor is DEFLATE, which is a combination of Lempel-Ziff and Huffman. This means that it works best with values and patterns that repeat often. Most adjacent pixels tend to be similar colors. So each row is assigned a filter which turn the original pixel values into a differential encoding. This way the values that end up encoded by DEFLATE are mostly close to 0. In the extreme case this will turn a smooth gradient from all different values into a single value throughout the row which the LZ portion or DEFLATE makes very quick work of.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Jan 10 at 10:13

























                answered Jan 10 at 10:01









                ratchet freakratchet freak

                2,67888




                2,67888












                • $begingroup$
                  Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                  $endgroup$
                  – robert
                  Jan 10 at 10:04










                • $begingroup$
                  with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                  $endgroup$
                  – ratchet freak
                  Jan 10 at 11:55










                • $begingroup$
                  But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                  $endgroup$
                  – robert
                  Jan 10 at 12:48






                • 1




                  $begingroup$
                  @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                  $endgroup$
                  – robert
                  Jan 10 at 13:03






                • 1




                  $begingroup$
                  @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                  $endgroup$
                  – kutschkem
                  Jan 10 at 13:19


















                • $begingroup$
                  Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                  $endgroup$
                  – robert
                  Jan 10 at 10:04










                • $begingroup$
                  with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                  $endgroup$
                  – ratchet freak
                  Jan 10 at 11:55










                • $begingroup$
                  But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                  $endgroup$
                  – robert
                  Jan 10 at 12:48






                • 1




                  $begingroup$
                  @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                  $endgroup$
                  – robert
                  Jan 10 at 13:03






                • 1




                  $begingroup$
                  @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                  $endgroup$
                  – kutschkem
                  Jan 10 at 13:19
















                $begingroup$
                Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                $endgroup$
                – robert
                Jan 10 at 10:04




                $begingroup$
                Does that mean apparent entropy is different from the actual information content of a message? How is that related to the actual entropy of the message?
                $endgroup$
                – robert
                Jan 10 at 10:04












                $begingroup$
                with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                $endgroup$
                – ratchet freak
                Jan 10 at 11:55




                $begingroup$
                with "apparent entropy" I mean the entropy that the entropy encode can compress down to. Different encoder will have different patterns they look for. Huffman does best when the same few symbols are reused often used often, lempel-ziff does best when chunks are repeated, etc.
                $endgroup$
                – ratchet freak
                Jan 10 at 11:55












                $begingroup$
                But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                $endgroup$
                – robert
                Jan 10 at 12:48




                $begingroup$
                But the Lempel-Ziv algorithms are not entropy coding algorithms, right? What I do not understand is why they are used before entropy coders in e.g. LZMA, when the entropy coder on its own could supposedly already compress the message down to its minimum.
                $endgroup$
                – robert
                Jan 10 at 12:48




                1




                1




                $begingroup$
                @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                $endgroup$
                – robert
                Jan 10 at 13:03




                $begingroup$
                @kutschkem Does this mean entropy is not an absolute measure of the information content of a message but is relative to what is defined as a symbol (e.g. a single character is considered a symbol vs. 1 bit being considered a symbol)? I think that would explain where my assumptions were wrong.
                $endgroup$
                – robert
                Jan 10 at 13:03




                1




                1




                $begingroup$
                @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                $endgroup$
                – kutschkem
                Jan 10 at 13:19




                $begingroup$
                @robert ... There is a tradeoff though, which is the "out-of-band" information Luke mentions in his answer, which is generally added by those steps (lookup tables to be able to decode the encoded information). So it makes no sense to define the whole content as one symbol, and encode it as 0 because somewhere the information has to be stored what this 0 encodes.
                $endgroup$
                – kutschkem
                Jan 10 at 13:19











                2












                $begingroup$

                The word "Entropy" if often used a bit loosely, to refer to two different things:




                • The "total amount of information" in a message or system


                • The information "density", or how tightly the information is packed.



                OP's quote of the Wikipedia's entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:



                Shannon's entropy measures the information contained in a message


                But (at least when I'm writing this) the same article starts with:



                Information entropy is the average rate at which information is produced by a stochastic source of data.


                So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes called "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).



                A classic example of the distinction is Paul Revere's famous lantern signal: "one if by land, and two if by sea". 1 bit of total information (if we ignore the "none if I haven't gotten to North Church yet" case). If Paul added another set of lanterns in each window of the building, that would be '''redundant''': no more information, so the same "total" or "extensive" entropy; but much more message length, so much lower "intensive" entropy.



                If he starts that way but changes to use only one set of lanterns, that's "lossless compression" like in OP's question. The "extensive" entropy is the same, but the "intensive" entropy" is different: Because the number of lanterns in the 2nd window is highly correlated with how many you've seen in the first, the redundant message is more predictable, or less random, so has much lower intensive entropy.



                There are two other important things to remember:




                • First, we typically don't know the "true" entropy of a system in either sense. A naive bystander doesn't know whether "3 lanterns" would be different message, or whether signals in different window are redundant or not. If Paul makes his ride a habit, we can count and see whether the windows always match each other. But maybe we just haven't watched long enough to see the rare (and probably important!) exceptions.



                • Second, it matters how you measure. Consider trying to estimate how much is communicated by each successive letter of text (that's a rate, so "intensive" entropy, also sometimes called "relative entropy"):




                  • If you just notice that people send text around in 8-bit units, your first "estimate" might be 8 bits per letter.

                  • If you count the number of distinct letters being used, you'd estimate log2(26), or 4.7 bits per letter (a bit higher if you consider spaces, case, etc).

                  • If you consider that "e" is a better bet for "next letter" than "z", you'll measure letter frequencies and get around 4.14 (see http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm).

                  • If you count letter-pairs you'll pick up on patterns like "qu", "th", etc., and get around 3.56.

                  • If you count sequences of up to around 5 letters, you'll get still lower values, and as a bonus you can pretty reliably distinguish what human language the text is in).

                  • If you're as hard-core and clever as N. G. Burton and J. C. R. Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68(1955)), you can get up to sequences of 10,0000 letters in a row, and find yet another entropy value.




                But of course, messages can (and do) have many patterns that aren't modelled by such n-gram methods, so the "true" entropy is lower still.



                If you model a theoretical infinite source with perfectly random Zipfian distribution of tokens, you can calculate the extensive and intensive entropy it would have, which turns out to depend only on the number of possible distinct tokens. Graphs of what each type of entropy looks like as that number increases, are in [http://www.derose.net/steve/writings/dissertation/Diss.0.html]. The two behave quite differently:




                • total (extensive): http://www.derose.net/steve/writings/dissertation/Graph13.png

                • relative (intensive): http://www.derose.net/steve/writings/dissertation/Graph14.png


                Hope that helps or is at least interesting...






                share|cite|improve this answer









                $endgroup$


















                  2












                  $begingroup$

                  The word "Entropy" if often used a bit loosely, to refer to two different things:




                  • The "total amount of information" in a message or system


                  • The information "density", or how tightly the information is packed.



                  OP's quote of the Wikipedia's entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:



                  Shannon's entropy measures the information contained in a message


                  But (at least when I'm writing this) the same article starts with:



                  Information entropy is the average rate at which information is produced by a stochastic source of data.


                  So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes called "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).



                  A classic example of the distinction is Paul Revere's famous lantern signal: "one if by land, and two if by sea". 1 bit of total information (if we ignore the "none if I haven't gotten to North Church yet" case). If Paul added another set of lanterns in each window of the building, that would be '''redundant''': no more information, so the same "total" or "extensive" entropy; but much more message length, so much lower "intensive" entropy.



                  If he starts that way but changes to use only one set of lanterns, that's "lossless compression" like in OP's question. The "extensive" entropy is the same, but the "intensive" entropy" is different: Because the number of lanterns in the 2nd window is highly correlated with how many you've seen in the first, the redundant message is more predictable, or less random, so has much lower intensive entropy.



                  There are two other important things to remember:




                  • First, we typically don't know the "true" entropy of a system in either sense. A naive bystander doesn't know whether "3 lanterns" would be different message, or whether signals in different window are redundant or not. If Paul makes his ride a habit, we can count and see whether the windows always match each other. But maybe we just haven't watched long enough to see the rare (and probably important!) exceptions.



                  • Second, it matters how you measure. Consider trying to estimate how much is communicated by each successive letter of text (that's a rate, so "intensive" entropy, also sometimes called "relative entropy"):




                    • If you just notice that people send text around in 8-bit units, your first "estimate" might be 8 bits per letter.

                    • If you count the number of distinct letters being used, you'd estimate log2(26), or 4.7 bits per letter (a bit higher if you consider spaces, case, etc).

                    • If you consider that "e" is a better bet for "next letter" than "z", you'll measure letter frequencies and get around 4.14 (see http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm).

                    • If you count letter-pairs you'll pick up on patterns like "qu", "th", etc., and get around 3.56.

                    • If you count sequences of up to around 5 letters, you'll get still lower values, and as a bonus you can pretty reliably distinguish what human language the text is in).

                    • If you're as hard-core and clever as N. G. Burton and J. C. R. Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68(1955)), you can get up to sequences of 10,0000 letters in a row, and find yet another entropy value.




                  But of course, messages can (and do) have many patterns that aren't modelled by such n-gram methods, so the "true" entropy is lower still.



                  If you model a theoretical infinite source with perfectly random Zipfian distribution of tokens, you can calculate the extensive and intensive entropy it would have, which turns out to depend only on the number of possible distinct tokens. Graphs of what each type of entropy looks like as that number increases, are in [http://www.derose.net/steve/writings/dissertation/Diss.0.html]. The two behave quite differently:




                  • total (extensive): http://www.derose.net/steve/writings/dissertation/Graph13.png

                  • relative (intensive): http://www.derose.net/steve/writings/dissertation/Graph14.png


                  Hope that helps or is at least interesting...






                  share|cite|improve this answer









                  $endgroup$
















                    2












                    2








                    2





                    $begingroup$

                    The word "Entropy" if often used a bit loosely, to refer to two different things:




                    • The "total amount of information" in a message or system


                    • The information "density", or how tightly the information is packed.



                    OP's quote of the Wikipedia's entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:



                    Shannon's entropy measures the information contained in a message


                    But (at least when I'm writing this) the same article starts with:



                    Information entropy is the average rate at which information is produced by a stochastic source of data.


                    So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes called "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).



                    A classic example of the distinction is Paul Revere's famous lantern signal: "one if by land, and two if by sea". 1 bit of total information (if we ignore the "none if I haven't gotten to North Church yet" case). If Paul added another set of lanterns in each window of the building, that would be '''redundant''': no more information, so the same "total" or "extensive" entropy; but much more message length, so much lower "intensive" entropy.



                    If he starts that way but changes to use only one set of lanterns, that's "lossless compression" like in OP's question. The "extensive" entropy is the same, but the "intensive" entropy" is different: Because the number of lanterns in the 2nd window is highly correlated with how many you've seen in the first, the redundant message is more predictable, or less random, so has much lower intensive entropy.



                    There are two other important things to remember:




                    • First, we typically don't know the "true" entropy of a system in either sense. A naive bystander doesn't know whether "3 lanterns" would be different message, or whether signals in different window are redundant or not. If Paul makes his ride a habit, we can count and see whether the windows always match each other. But maybe we just haven't watched long enough to see the rare (and probably important!) exceptions.



                    • Second, it matters how you measure. Consider trying to estimate how much is communicated by each successive letter of text (that's a rate, so "intensive" entropy, also sometimes called "relative entropy"):




                      • If you just notice that people send text around in 8-bit units, your first "estimate" might be 8 bits per letter.

                      • If you count the number of distinct letters being used, you'd estimate log2(26), or 4.7 bits per letter (a bit higher if you consider spaces, case, etc).

                      • If you consider that "e" is a better bet for "next letter" than "z", you'll measure letter frequencies and get around 4.14 (see http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm).

                      • If you count letter-pairs you'll pick up on patterns like "qu", "th", etc., and get around 3.56.

                      • If you count sequences of up to around 5 letters, you'll get still lower values, and as a bonus you can pretty reliably distinguish what human language the text is in).

                      • If you're as hard-core and clever as N. G. Burton and J. C. R. Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68(1955)), you can get up to sequences of 10,0000 letters in a row, and find yet another entropy value.




                    But of course, messages can (and do) have many patterns that aren't modelled by such n-gram methods, so the "true" entropy is lower still.



                    If you model a theoretical infinite source with perfectly random Zipfian distribution of tokens, you can calculate the extensive and intensive entropy it would have, which turns out to depend only on the number of possible distinct tokens. Graphs of what each type of entropy looks like as that number increases, are in [http://www.derose.net/steve/writings/dissertation/Diss.0.html]. The two behave quite differently:




                    • total (extensive): http://www.derose.net/steve/writings/dissertation/Graph13.png

                    • relative (intensive): http://www.derose.net/steve/writings/dissertation/Graph14.png


                    Hope that helps or is at least interesting...






                    share|cite|improve this answer









                    $endgroup$



                    The word "Entropy" if often used a bit loosely, to refer to two different things:




                    • The "total amount of information" in a message or system


                    • The information "density", or how tightly the information is packed.



                    OP's quote of the Wikipedia's entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:



                    Shannon's entropy measures the information contained in a message


                    But (at least when I'm writing this) the same article starts with:



                    Information entropy is the average rate at which information is produced by a stochastic source of data.


                    So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes called "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).



                    A classic example of the distinction is Paul Revere's famous lantern signal: "one if by land, and two if by sea". 1 bit of total information (if we ignore the "none if I haven't gotten to North Church yet" case). If Paul added another set of lanterns in each window of the building, that would be '''redundant''': no more information, so the same "total" or "extensive" entropy; but much more message length, so much lower "intensive" entropy.



                    If he starts that way but changes to use only one set of lanterns, that's "lossless compression" like in OP's question. The "extensive" entropy is the same, but the "intensive" entropy" is different: Because the number of lanterns in the 2nd window is highly correlated with how many you've seen in the first, the redundant message is more predictable, or less random, so has much lower intensive entropy.



                    There are two other important things to remember:




                    • First, we typically don't know the "true" entropy of a system in either sense. A naive bystander doesn't know whether "3 lanterns" would be different message, or whether signals in different window are redundant or not. If Paul makes his ride a habit, we can count and see whether the windows always match each other. But maybe we just haven't watched long enough to see the rare (and probably important!) exceptions.



                    • Second, it matters how you measure. Consider trying to estimate how much is communicated by each successive letter of text (that's a rate, so "intensive" entropy, also sometimes called "relative entropy"):




                      • If you just notice that people send text around in 8-bit units, your first "estimate" might be 8 bits per letter.

                      • If you count the number of distinct letters being used, you'd estimate log2(26), or 4.7 bits per letter (a bit higher if you consider spaces, case, etc).

                      • If you consider that "e" is a better bet for "next letter" than "z", you'll measure letter frequencies and get around 4.14 (see http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm).

                      • If you count letter-pairs you'll pick up on patterns like "qu", "th", etc., and get around 3.56.

                      • If you count sequences of up to around 5 letters, you'll get still lower values, and as a bonus you can pretty reliably distinguish what human language the text is in).

                      • If you're as hard-core and clever as N. G. Burton and J. C. R. Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68(1955)), you can get up to sequences of 10,0000 letters in a row, and find yet another entropy value.




                    But of course, messages can (and do) have many patterns that aren't modelled by such n-gram methods, so the "true" entropy is lower still.



                    If you model a theoretical infinite source with perfectly random Zipfian distribution of tokens, you can calculate the extensive and intensive entropy it would have, which turns out to depend only on the number of possible distinct tokens. Graphs of what each type of entropy looks like as that number increases, are in [http://www.derose.net/steve/writings/dissertation/Diss.0.html]. The two behave quite differently:




                    • total (extensive): http://www.derose.net/steve/writings/dissertation/Graph13.png

                    • relative (intensive): http://www.derose.net/steve/writings/dissertation/Graph14.png


                    Hope that helps or is at least interesting...







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered Jan 12 at 17:17









                    TextGeekTextGeek

                    1213




                    1213























                        1












                        $begingroup$

                        I suspect the wording in the German Wikipedia is mistaken. Compressors increase the entropy. That is to say, not the overall entropy, but the entropy per bit: the information density. E.g. some run-length encoding and dictionary scheme is applied to condense the data. Now the same information is packed into fewer bits, so each bit carries more information. The subsequent Huffman coding does a little more of the same; it's just another layer of compression.






                        share|cite|improve this answer









                        $endgroup$


















                          1












                          $begingroup$

                          I suspect the wording in the German Wikipedia is mistaken. Compressors increase the entropy. That is to say, not the overall entropy, but the entropy per bit: the information density. E.g. some run-length encoding and dictionary scheme is applied to condense the data. Now the same information is packed into fewer bits, so each bit carries more information. The subsequent Huffman coding does a little more of the same; it's just another layer of compression.






                          share|cite|improve this answer









                          $endgroup$
















                            1












                            1








                            1





                            $begingroup$

                            I suspect the wording in the German Wikipedia is mistaken. Compressors increase the entropy. That is to say, not the overall entropy, but the entropy per bit: the information density. E.g. some run-length encoding and dictionary scheme is applied to condense the data. Now the same information is packed into fewer bits, so each bit carries more information. The subsequent Huffman coding does a little more of the same; it's just another layer of compression.






                            share|cite|improve this answer









                            $endgroup$



                            I suspect the wording in the German Wikipedia is mistaken. Compressors increase the entropy. That is to say, not the overall entropy, but the entropy per bit: the information density. E.g. some run-length encoding and dictionary scheme is applied to condense the data. Now the same information is packed into fewer bits, so each bit carries more information. The subsequent Huffman coding does a little more of the same; it's just another layer of compression.







                            share|cite|improve this answer












                            share|cite|improve this answer



                            share|cite|improve this answer










                            answered Jan 12 at 5:49









                            KazKaz

                            1495




                            1495






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Computer Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f102647%2fdo-lossless-compression-algorithms-reduce-entropy%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Mario Kart Wii

                                The Binding of Isaac: Rebirth/Afterbirth

                                What does “Dominus providebit” mean?