least square failure as classifier












0














I was reading pattern recognition and machine learning by Christopher bishop in chapter 4.1.3 page 186 about least square classification failure I stumbled on this phrase




"The failure of least square should not surprise us when we recall
that it corresponds to Maximum likelihood under the assumption of a
Gaussian conditional distribution"
however, I can not understand this! what is least square relation with conditional? why are we talking about conditional distribution? and how can it relate to gaussian?
I would be so grateful if U could help me. please.
enter image description here











share|cite|improve this question



























    0














    I was reading pattern recognition and machine learning by Christopher bishop in chapter 4.1.3 page 186 about least square classification failure I stumbled on this phrase




    "The failure of least square should not surprise us when we recall
    that it corresponds to Maximum likelihood under the assumption of a
    Gaussian conditional distribution"
    however, I can not understand this! what is least square relation with conditional? why are we talking about conditional distribution? and how can it relate to gaussian?
    I would be so grateful if U could help me. please.
    enter image description here











    share|cite|improve this question

























      0












      0








      0







      I was reading pattern recognition and machine learning by Christopher bishop in chapter 4.1.3 page 186 about least square classification failure I stumbled on this phrase




      "The failure of least square should not surprise us when we recall
      that it corresponds to Maximum likelihood under the assumption of a
      Gaussian conditional distribution"
      however, I can not understand this! what is least square relation with conditional? why are we talking about conditional distribution? and how can it relate to gaussian?
      I would be so grateful if U could help me. please.
      enter image description here











      share|cite|improve this question













      I was reading pattern recognition and machine learning by Christopher bishop in chapter 4.1.3 page 186 about least square classification failure I stumbled on this phrase




      "The failure of least square should not surprise us when we recall
      that it corresponds to Maximum likelihood under the assumption of a
      Gaussian conditional distribution"
      however, I can not understand this! what is least square relation with conditional? why are we talking about conditional distribution? and how can it relate to gaussian?
      I would be so grateful if U could help me. please.
      enter image description here








      statistics statistical-inference conditional-expectation machine-learning maximum-likelihood






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Jan 6 at 19:30









      Hoda FakharzadehHoda Fakharzadeh

      153




      153






















          1 Answer
          1






          active

          oldest

          votes


















          0














          Suppose the relationship between the feature vectors $mathbf x_i$ and the target variables $y_i$ is modelled as



          $$y_i = f(mathbf x_i) + epsilon,$$



          where the function $f$ represents the "true model", and $epsilon sim mathcal N(0, sigma^2)$ is Gaussian noise.



          Then the log likelihood for the dataset is
          $$ log P(y_1, dots, y_N | mathbf x_1 , dots, mathbf x_N) = - frac{1}{2sigma^2} sum_{i=1}^N (y_i - f(mathbf x_i))^2 - frac{N}{2} log (2pi sigma^2).$$



          Treating $sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,



          $$ L(y_1, dots, y_N | mathbf x_1, dots, mathbf x_n) = sum_{i=1}^N (y_i - f(mathbf x_i))^2.$$



          So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.



          The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!






          share|cite|improve this answer





















          • thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
            – Hoda Fakharzadeh
            Jan 6 at 22:21












          • @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
            – Kenny Wong
            Jan 6 at 22:22











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "69"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3064311%2fleast-square-failure-as-classifier%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Suppose the relationship between the feature vectors $mathbf x_i$ and the target variables $y_i$ is modelled as



          $$y_i = f(mathbf x_i) + epsilon,$$



          where the function $f$ represents the "true model", and $epsilon sim mathcal N(0, sigma^2)$ is Gaussian noise.



          Then the log likelihood for the dataset is
          $$ log P(y_1, dots, y_N | mathbf x_1 , dots, mathbf x_N) = - frac{1}{2sigma^2} sum_{i=1}^N (y_i - f(mathbf x_i))^2 - frac{N}{2} log (2pi sigma^2).$$



          Treating $sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,



          $$ L(y_1, dots, y_N | mathbf x_1, dots, mathbf x_n) = sum_{i=1}^N (y_i - f(mathbf x_i))^2.$$



          So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.



          The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!






          share|cite|improve this answer





















          • thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
            – Hoda Fakharzadeh
            Jan 6 at 22:21












          • @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
            – Kenny Wong
            Jan 6 at 22:22
















          0














          Suppose the relationship between the feature vectors $mathbf x_i$ and the target variables $y_i$ is modelled as



          $$y_i = f(mathbf x_i) + epsilon,$$



          where the function $f$ represents the "true model", and $epsilon sim mathcal N(0, sigma^2)$ is Gaussian noise.



          Then the log likelihood for the dataset is
          $$ log P(y_1, dots, y_N | mathbf x_1 , dots, mathbf x_N) = - frac{1}{2sigma^2} sum_{i=1}^N (y_i - f(mathbf x_i))^2 - frac{N}{2} log (2pi sigma^2).$$



          Treating $sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,



          $$ L(y_1, dots, y_N | mathbf x_1, dots, mathbf x_n) = sum_{i=1}^N (y_i - f(mathbf x_i))^2.$$



          So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.



          The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!






          share|cite|improve this answer





















          • thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
            – Hoda Fakharzadeh
            Jan 6 at 22:21












          • @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
            – Kenny Wong
            Jan 6 at 22:22














          0












          0








          0






          Suppose the relationship between the feature vectors $mathbf x_i$ and the target variables $y_i$ is modelled as



          $$y_i = f(mathbf x_i) + epsilon,$$



          where the function $f$ represents the "true model", and $epsilon sim mathcal N(0, sigma^2)$ is Gaussian noise.



          Then the log likelihood for the dataset is
          $$ log P(y_1, dots, y_N | mathbf x_1 , dots, mathbf x_N) = - frac{1}{2sigma^2} sum_{i=1}^N (y_i - f(mathbf x_i))^2 - frac{N}{2} log (2pi sigma^2).$$



          Treating $sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,



          $$ L(y_1, dots, y_N | mathbf x_1, dots, mathbf x_n) = sum_{i=1}^N (y_i - f(mathbf x_i))^2.$$



          So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.



          The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!






          share|cite|improve this answer












          Suppose the relationship between the feature vectors $mathbf x_i$ and the target variables $y_i$ is modelled as



          $$y_i = f(mathbf x_i) + epsilon,$$



          where the function $f$ represents the "true model", and $epsilon sim mathcal N(0, sigma^2)$ is Gaussian noise.



          Then the log likelihood for the dataset is
          $$ log P(y_1, dots, y_N | mathbf x_1 , dots, mathbf x_N) = - frac{1}{2sigma^2} sum_{i=1}^N (y_i - f(mathbf x_i))^2 - frac{N}{2} log (2pi sigma^2).$$



          Treating $sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,



          $$ L(y_1, dots, y_N | mathbf x_1, dots, mathbf x_n) = sum_{i=1}^N (y_i - f(mathbf x_i))^2.$$



          So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.



          The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Jan 6 at 21:39









          Kenny WongKenny Wong

          18.3k21438




          18.3k21438












          • thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
            – Hoda Fakharzadeh
            Jan 6 at 22:21












          • @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
            – Kenny Wong
            Jan 6 at 22:22


















          • thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
            – Hoda Fakharzadeh
            Jan 6 at 22:21












          • @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
            – Kenny Wong
            Jan 6 at 22:22
















          thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
          – Hoda Fakharzadeh
          Jan 6 at 22:21






          thank u for your answer I kind of understand it now but, it also talks about conditional Gaussian? how is it conditional?
          – Hoda Fakharzadeh
          Jan 6 at 22:21














          @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
          – Kenny Wong
          Jan 6 at 22:22




          @HodaFakharzadeh I suppose you can say $P(y_i | mathbf x_i) = mathcal N(y_i | f(mathbf x_i), sigma^2)$.
          – Kenny Wong
          Jan 6 at 22:22


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Mathematics Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3064311%2fleast-square-failure-as-classifier%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Mario Kart Wii

          What does “Dominus providebit” mean?

          Antonio Litta Visconti Arese