Is it correct to say the Neural Networks are an alternative way of performing Maximum Likelihood Estimation? if not, why? [duplicate] Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Can we use MLE to estimate Neural Network weights?What is the difference between Maximum Likelihood Estimation & Gradient Descent?Likelihood in Linear RegressionAre loss functions what define the identity of each supervised machine learning algorithm?What can we say about the likelihood function, besides using it in maximum likelihood estimation?Why is maximum likelihood estimation considered to be a frequentist techniqueMaximum Likelihood Estimation — why it is used despite being biased in many casesWhat is the objective of maximum likelihood estimation?Maximum Likelihood estimation and the Kalman filterWhy does Maximum Likelihood estimation maximizes probability density instead of probabilityWhy are the Least-Squares and Maximum-Likelihood methods of regression not equivalent when the errors are not normally distributed?the relationship between maximizing the likelihood and minimizing the cross-entropythe meaning of likelihood in maximum likelihood estimationHow to construct a cross-entropy loss for general regression targets?

ListPlot join points by nearest neighbor rather than order

Output the ŋarâþ crîþ alphabet song without using (m)any letters

How much radiation do nuclear physics experiments expose researchers to nowadays?

Is the Standard Deduction better than Itemized when both are the same amount?

I am not a queen, who am I?

How to deal with a team lead who never gives me credit?

Models of set theory where not every set can be linearly ordered

If Jon Snow became King of the Seven Kingdoms what would his regnal number be?

Determinant is linear as a function of each of the rows of the matrix.

How to recreate this effect in Photoshop?

What is the longest distance a 13th-level monk can jump while attacking on the same turn?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Can Pao de Queijo, and similar foods, be kosher for Passover?

Should I discuss the type of campaign with my players?

What is a Meta algorithm?

Why is "Captain Marvel" translated as male in Portugal?

List *all* the tuples!

Diagram with tikz

Is there a "higher Segal conjecture"?

Single word antonym of "flightless"

How can I fade player when goes inside or outside of the area?

Did Xerox really develop the first LAN?

Why is "Consequences inflicted." not a sentence?

Why does Python start at index -1 when indexing a list from the end?



Is it correct to say the Neural Networks are an alternative way of performing Maximum Likelihood Estimation? if not, why? [duplicate]



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Can we use MLE to estimate Neural Network weights?What is the difference between Maximum Likelihood Estimation & Gradient Descent?Likelihood in Linear RegressionAre loss functions what define the identity of each supervised machine learning algorithm?What can we say about the likelihood function, besides using it in maximum likelihood estimation?Why is maximum likelihood estimation considered to be a frequentist techniqueMaximum Likelihood Estimation — why it is used despite being biased in many casesWhat is the objective of maximum likelihood estimation?Maximum Likelihood estimation and the Kalman filterWhy does Maximum Likelihood estimation maximizes probability density instead of probabilityWhy are the Least-Squares and Maximum-Likelihood methods of regression not equivalent when the errors are not normally distributed?the relationship between maximizing the likelihood and minimizing the cross-entropythe meaning of likelihood in maximum likelihood estimationHow to construct a cross-entropy loss for general regression targets?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3












$begingroup$



This question already has an answer here:



  • Can we use MLE to estimate Neural Network weights?

    2 answers



  • What is the difference between Maximum Likelihood Estimation & Gradient Descent?

    2 answers



We often say that minimizing the (negative) cross-entropy error is the same as maximizing the likelihood. So can we say that NN are just an alternative way of performing Maximum Likelihood Estimation? if not, why?










share|cite|improve this question







New contributor




aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$



marked as duplicate by kjetil b halvorsen, Robert Long, Siong Thye Goh, mkt, shimao 2 days ago


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
























    3












    $begingroup$



    This question already has an answer here:



    • Can we use MLE to estimate Neural Network weights?

      2 answers



    • What is the difference between Maximum Likelihood Estimation & Gradient Descent?

      2 answers



    We often say that minimizing the (negative) cross-entropy error is the same as maximizing the likelihood. So can we say that NN are just an alternative way of performing Maximum Likelihood Estimation? if not, why?










    share|cite|improve this question







    New contributor




    aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$



    marked as duplicate by kjetil b halvorsen, Robert Long, Siong Thye Goh, mkt, shimao 2 days ago


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.




















      3












      3








      3


      2



      $begingroup$



      This question already has an answer here:



      • Can we use MLE to estimate Neural Network weights?

        2 answers



      • What is the difference between Maximum Likelihood Estimation & Gradient Descent?

        2 answers



      We often say that minimizing the (negative) cross-entropy error is the same as maximizing the likelihood. So can we say that NN are just an alternative way of performing Maximum Likelihood Estimation? if not, why?










      share|cite|improve this question







      New contributor




      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$





      This question already has an answer here:



      • Can we use MLE to estimate Neural Network weights?

        2 answers



      • What is the difference between Maximum Likelihood Estimation & Gradient Descent?

        2 answers



      We often say that minimizing the (negative) cross-entropy error is the same as maximizing the likelihood. So can we say that NN are just an alternative way of performing Maximum Likelihood Estimation? if not, why?





      This question already has an answer here:



      • Can we use MLE to estimate Neural Network weights?

        2 answers



      • What is the difference between Maximum Likelihood Estimation & Gradient Descent?

        2 answers







      neural-networks maximum-likelihood






      share|cite|improve this question







      New contributor




      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|cite|improve this question







      New contributor




      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|cite|improve this question




      share|cite|improve this question






      New contributor




      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Apr 11 at 17:17









      aca06aca06

      182




      182




      New contributor




      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      aca06 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      marked as duplicate by kjetil b halvorsen, Robert Long, Siong Thye Goh, mkt, shimao 2 days ago


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









      marked as duplicate by kjetil b halvorsen, Robert Long, Siong Thye Goh, mkt, shimao 2 days ago


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






















          3 Answers
          3






          active

          oldest

          votes


















          3












          $begingroup$

          There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:



          1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.


          2. Given a concrete dataset consisting of 'real' answers $y^(i)$ and real $d$-dimensional data vectors $x^(i) = (x^(i)_1, ..., x^(i)_d)$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).


          Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $w = (w_1, ..., w_d)$ are the parameters and the function is:
          $$f_w(x) = sigmaleft(sum_j=1^d w_j x_jright)$$
          The cross entropy loss function is
          $$l(haty, y) = -[y log(haty) + (1-y) log(1-haty)] $$



          So given the dataset $y^(i), x^(i)$ as above, we form
          $$sum_i=1^n l(y^(i), f_w(x^(i)))$$
          and minimize that in order to find the parameters $w$ for the neural network. Let us put that aside for a moment and go for a completely different model.



          We assume that there are random variables $(X^(i), Y^(i))_i=1,...,n$ such that $(X^(i), Y^(i))$ are iid. and such that
          $$P[Y^(i)=1|X^(i)=x^(i)] = f_w(x^(i))$$
          where again, $theta=w=(w_1,...,w_d)$ are the parameters of the model. Let us setup the likelihood: Put $Y = (Y^(1), ..., Y^(n))$ and $X = (X^(1), ..., X^(n))$ and $y = (y^(1), ..., y^(n))$ and $x = (x^(1), ..., x^(n))$. Since the $Z^(i) = (X^(i), Y^(i))$ are independent,
          beginalign*
          P[Y=y|X=x] &= prod_i=1^n P[Y^(i)=y^(i)|X^(i)=x^(i)] \
          &= prod_i : y^(i)=1 P[Y^(i)=1|X^(i)=x^(i)] prod_i:y^(i)=0 (1 - P[Y^(i)=1|X^(i)=x^(i)]) \
          &= prod_i : y^(i)=1 f_w(x^(i)) prod_i:y^(i)=0 (1 - f_w(x^(i))) \
          &= prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i)
          endalign*

          So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $w$. Uuuh, there is an ugly product in front... The rule $(fg)' = f'g + fg'$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $-log(textlikelihood)$ and minimize that... So let us compute $-log(textlikelihood)$: Using $log(ab) = log(a) + log(b)$ and $log(a^b) = blog(a)$ we obtain



          beginalign*
          -log(textlikelihood) &= -log left( prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i) right) \
          &= - sum_i=1^n y^(i) log(f_w(x^(i))) + (1-y^(i)) log(1-f_w(x^(i)))
          endalign*



          and if you now compare carefully to the NN model above you will see that this is actually nothing else than $sum_i=1^n l(y^(i), f_w(x^(i)))$.



          So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always



          $$-log(textlikelihood) = textloss function$$
          and
          $$e^-textloss function = textlikelihood$$



          In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.



          Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.



          The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.






          share|cite|improve this answer









          $endgroup$












          • $begingroup$
            Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
            $endgroup$
            – aca06
            Apr 12 at 12:03



















          3












          $begingroup$

          In abstract terms, neural networks are models, or if you prefer, functions with unknown parameters, where we try to learn the parameter by minimizing loss function (not just cross entropy, there are many other possibilities). In general, minimizing loss is in most cases equivalent to maximizing some likelihood function, but as discussed in this thread, it's not that simple.



          You cannot say that they are equivalent, because minimizing loss, or maximizing likelihood is a method of finding the parameters, while neural network is the function defined in terms of those parameters.






          share|cite|improve this answer









          $endgroup$








          • 1




            $begingroup$
            I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
            $endgroup$
            – Sycorax
            Apr 11 at 20:02







          • 1




            $begingroup$
            @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
            $endgroup$
            – Tim
            Apr 11 at 20:03






          • 1




            $begingroup$
            What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
            $endgroup$
            – aca06
            Apr 11 at 20:13






          • 2




            $begingroup$
            @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
            $endgroup$
            – Tim
            Apr 11 at 20:17


















          1












          $begingroup$

          These are fairly orthogonal topics.



          Neural networks are a type of model which has a very large number of parameters. Maximum Likelihood Estimation is a very common method for estimating parameters from a given model and data. Typically, a model will allow you to compute a likelihood function from a model, data and parameter values. Since we don't know what the actual parameter values are, one way of estimating them is to use the value that maximizes the given likelihood. Neural networks are our model, maximum likelihood estimation is one method for estimating the parameters of our model.



          One slightly technical note is that often, Maximum Likelihood Estimation is not exactly used in Neural Networks. That is, there are a lot of regularization methods used that imply we're not actually maximizing a likelihood function. These include:



          (1) Penalized maximum likelihood. This one is a bit of a cop-out, as it doesn't actually take too much effort to think of Penalized likelihoods as actually just a different likelihood (i.e., one with priors) that one is maximizing.



          (2) Random drop out. In especially a lot of the newer architectures, parameter values will randomly be set to 0 during training. This procedure is more definitely outside the realm of maximum likelihood estimation.



          (3) Early stopping. It's not the most popular method at all, but one way to prevent overfitting is just to stop the optimization algorithm before it converges. Again, this is technically not maximum likelihood estimation, it's really just an ad-hoc solution to overfitting.



          (4) Bayesian methods, probably the most common alternative to Maximum Likelihood Estimation in the statistics world, are also used for estimating the parameter values of a neural network. However, this is often too computationally intensive for large networks.






          share|cite|improve this answer











          $endgroup$



















            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3












            $begingroup$

            There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:



            1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.


            2. Given a concrete dataset consisting of 'real' answers $y^(i)$ and real $d$-dimensional data vectors $x^(i) = (x^(i)_1, ..., x^(i)_d)$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).


            Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $w = (w_1, ..., w_d)$ are the parameters and the function is:
            $$f_w(x) = sigmaleft(sum_j=1^d w_j x_jright)$$
            The cross entropy loss function is
            $$l(haty, y) = -[y log(haty) + (1-y) log(1-haty)] $$



            So given the dataset $y^(i), x^(i)$ as above, we form
            $$sum_i=1^n l(y^(i), f_w(x^(i)))$$
            and minimize that in order to find the parameters $w$ for the neural network. Let us put that aside for a moment and go for a completely different model.



            We assume that there are random variables $(X^(i), Y^(i))_i=1,...,n$ such that $(X^(i), Y^(i))$ are iid. and such that
            $$P[Y^(i)=1|X^(i)=x^(i)] = f_w(x^(i))$$
            where again, $theta=w=(w_1,...,w_d)$ are the parameters of the model. Let us setup the likelihood: Put $Y = (Y^(1), ..., Y^(n))$ and $X = (X^(1), ..., X^(n))$ and $y = (y^(1), ..., y^(n))$ and $x = (x^(1), ..., x^(n))$. Since the $Z^(i) = (X^(i), Y^(i))$ are independent,
            beginalign*
            P[Y=y|X=x] &= prod_i=1^n P[Y^(i)=y^(i)|X^(i)=x^(i)] \
            &= prod_i : y^(i)=1 P[Y^(i)=1|X^(i)=x^(i)] prod_i:y^(i)=0 (1 - P[Y^(i)=1|X^(i)=x^(i)]) \
            &= prod_i : y^(i)=1 f_w(x^(i)) prod_i:y^(i)=0 (1 - f_w(x^(i))) \
            &= prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i)
            endalign*

            So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $w$. Uuuh, there is an ugly product in front... The rule $(fg)' = f'g + fg'$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $-log(textlikelihood)$ and minimize that... So let us compute $-log(textlikelihood)$: Using $log(ab) = log(a) + log(b)$ and $log(a^b) = blog(a)$ we obtain



            beginalign*
            -log(textlikelihood) &= -log left( prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i) right) \
            &= - sum_i=1^n y^(i) log(f_w(x^(i))) + (1-y^(i)) log(1-f_w(x^(i)))
            endalign*



            and if you now compare carefully to the NN model above you will see that this is actually nothing else than $sum_i=1^n l(y^(i), f_w(x^(i)))$.



            So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always



            $$-log(textlikelihood) = textloss function$$
            and
            $$e^-textloss function = textlikelihood$$



            In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.



            Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.



            The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.






            share|cite|improve this answer









            $endgroup$












            • $begingroup$
              Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
              $endgroup$
              – aca06
              Apr 12 at 12:03
















            3












            $begingroup$

            There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:



            1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.


            2. Given a concrete dataset consisting of 'real' answers $y^(i)$ and real $d$-dimensional data vectors $x^(i) = (x^(i)_1, ..., x^(i)_d)$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).


            Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $w = (w_1, ..., w_d)$ are the parameters and the function is:
            $$f_w(x) = sigmaleft(sum_j=1^d w_j x_jright)$$
            The cross entropy loss function is
            $$l(haty, y) = -[y log(haty) + (1-y) log(1-haty)] $$



            So given the dataset $y^(i), x^(i)$ as above, we form
            $$sum_i=1^n l(y^(i), f_w(x^(i)))$$
            and minimize that in order to find the parameters $w$ for the neural network. Let us put that aside for a moment and go for a completely different model.



            We assume that there are random variables $(X^(i), Y^(i))_i=1,...,n$ such that $(X^(i), Y^(i))$ are iid. and such that
            $$P[Y^(i)=1|X^(i)=x^(i)] = f_w(x^(i))$$
            where again, $theta=w=(w_1,...,w_d)$ are the parameters of the model. Let us setup the likelihood: Put $Y = (Y^(1), ..., Y^(n))$ and $X = (X^(1), ..., X^(n))$ and $y = (y^(1), ..., y^(n))$ and $x = (x^(1), ..., x^(n))$. Since the $Z^(i) = (X^(i), Y^(i))$ are independent,
            beginalign*
            P[Y=y|X=x] &= prod_i=1^n P[Y^(i)=y^(i)|X^(i)=x^(i)] \
            &= prod_i : y^(i)=1 P[Y^(i)=1|X^(i)=x^(i)] prod_i:y^(i)=0 (1 - P[Y^(i)=1|X^(i)=x^(i)]) \
            &= prod_i : y^(i)=1 f_w(x^(i)) prod_i:y^(i)=0 (1 - f_w(x^(i))) \
            &= prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i)
            endalign*

            So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $w$. Uuuh, there is an ugly product in front... The rule $(fg)' = f'g + fg'$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $-log(textlikelihood)$ and minimize that... So let us compute $-log(textlikelihood)$: Using $log(ab) = log(a) + log(b)$ and $log(a^b) = blog(a)$ we obtain



            beginalign*
            -log(textlikelihood) &= -log left( prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i) right) \
            &= - sum_i=1^n y^(i) log(f_w(x^(i))) + (1-y^(i)) log(1-f_w(x^(i)))
            endalign*



            and if you now compare carefully to the NN model above you will see that this is actually nothing else than $sum_i=1^n l(y^(i), f_w(x^(i)))$.



            So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always



            $$-log(textlikelihood) = textloss function$$
            and
            $$e^-textloss function = textlikelihood$$



            In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.



            Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.



            The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.






            share|cite|improve this answer









            $endgroup$












            • $begingroup$
              Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
              $endgroup$
              – aca06
              Apr 12 at 12:03














            3












            3








            3





            $begingroup$

            There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:



            1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.


            2. Given a concrete dataset consisting of 'real' answers $y^(i)$ and real $d$-dimensional data vectors $x^(i) = (x^(i)_1, ..., x^(i)_d)$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).


            Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $w = (w_1, ..., w_d)$ are the parameters and the function is:
            $$f_w(x) = sigmaleft(sum_j=1^d w_j x_jright)$$
            The cross entropy loss function is
            $$l(haty, y) = -[y log(haty) + (1-y) log(1-haty)] $$



            So given the dataset $y^(i), x^(i)$ as above, we form
            $$sum_i=1^n l(y^(i), f_w(x^(i)))$$
            and minimize that in order to find the parameters $w$ for the neural network. Let us put that aside for a moment and go for a completely different model.



            We assume that there are random variables $(X^(i), Y^(i))_i=1,...,n$ such that $(X^(i), Y^(i))$ are iid. and such that
            $$P[Y^(i)=1|X^(i)=x^(i)] = f_w(x^(i))$$
            where again, $theta=w=(w_1,...,w_d)$ are the parameters of the model. Let us setup the likelihood: Put $Y = (Y^(1), ..., Y^(n))$ and $X = (X^(1), ..., X^(n))$ and $y = (y^(1), ..., y^(n))$ and $x = (x^(1), ..., x^(n))$. Since the $Z^(i) = (X^(i), Y^(i))$ are independent,
            beginalign*
            P[Y=y|X=x] &= prod_i=1^n P[Y^(i)=y^(i)|X^(i)=x^(i)] \
            &= prod_i : y^(i)=1 P[Y^(i)=1|X^(i)=x^(i)] prod_i:y^(i)=0 (1 - P[Y^(i)=1|X^(i)=x^(i)]) \
            &= prod_i : y^(i)=1 f_w(x^(i)) prod_i:y^(i)=0 (1 - f_w(x^(i))) \
            &= prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i)
            endalign*

            So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $w$. Uuuh, there is an ugly product in front... The rule $(fg)' = f'g + fg'$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $-log(textlikelihood)$ and minimize that... So let us compute $-log(textlikelihood)$: Using $log(ab) = log(a) + log(b)$ and $log(a^b) = blog(a)$ we obtain



            beginalign*
            -log(textlikelihood) &= -log left( prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i) right) \
            &= - sum_i=1^n y^(i) log(f_w(x^(i))) + (1-y^(i)) log(1-f_w(x^(i)))
            endalign*



            and if you now compare carefully to the NN model above you will see that this is actually nothing else than $sum_i=1^n l(y^(i), f_w(x^(i)))$.



            So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always



            $$-log(textlikelihood) = textloss function$$
            and
            $$e^-textloss function = textlikelihood$$



            In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.



            Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.



            The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.






            share|cite|improve this answer









            $endgroup$



            There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:



            1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.


            2. Given a concrete dataset consisting of 'real' answers $y^(i)$ and real $d$-dimensional data vectors $x^(i) = (x^(i)_1, ..., x^(i)_d)$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).


            Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $w = (w_1, ..., w_d)$ are the parameters and the function is:
            $$f_w(x) = sigmaleft(sum_j=1^d w_j x_jright)$$
            The cross entropy loss function is
            $$l(haty, y) = -[y log(haty) + (1-y) log(1-haty)] $$



            So given the dataset $y^(i), x^(i)$ as above, we form
            $$sum_i=1^n l(y^(i), f_w(x^(i)))$$
            and minimize that in order to find the parameters $w$ for the neural network. Let us put that aside for a moment and go for a completely different model.



            We assume that there are random variables $(X^(i), Y^(i))_i=1,...,n$ such that $(X^(i), Y^(i))$ are iid. and such that
            $$P[Y^(i)=1|X^(i)=x^(i)] = f_w(x^(i))$$
            where again, $theta=w=(w_1,...,w_d)$ are the parameters of the model. Let us setup the likelihood: Put $Y = (Y^(1), ..., Y^(n))$ and $X = (X^(1), ..., X^(n))$ and $y = (y^(1), ..., y^(n))$ and $x = (x^(1), ..., x^(n))$. Since the $Z^(i) = (X^(i), Y^(i))$ are independent,
            beginalign*
            P[Y=y|X=x] &= prod_i=1^n P[Y^(i)=y^(i)|X^(i)=x^(i)] \
            &= prod_i : y^(i)=1 P[Y^(i)=1|X^(i)=x^(i)] prod_i:y^(i)=0 (1 - P[Y^(i)=1|X^(i)=x^(i)]) \
            &= prod_i : y^(i)=1 f_w(x^(i)) prod_i:y^(i)=0 (1 - f_w(x^(i))) \
            &= prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i)
            endalign*

            So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $w$. Uuuh, there is an ugly product in front... The rule $(fg)' = f'g + fg'$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $-log(textlikelihood)$ and minimize that... So let us compute $-log(textlikelihood)$: Using $log(ab) = log(a) + log(b)$ and $log(a^b) = blog(a)$ we obtain



            beginalign*
            -log(textlikelihood) &= -log left( prod_i=1^n left(f_w(x^(i))right)^y^(i) left(1 - f_w(x^(i))right)^1 - y^(i) right) \
            &= - sum_i=1^n y^(i) log(f_w(x^(i))) + (1-y^(i)) log(1-f_w(x^(i)))
            endalign*



            and if you now compare carefully to the NN model above you will see that this is actually nothing else than $sum_i=1^n l(y^(i), f_w(x^(i)))$.



            So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always



            $$-log(textlikelihood) = textloss function$$
            and
            $$e^-textloss function = textlikelihood$$



            In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.



            Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.



            The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered Apr 12 at 6:39









            Fabian WernerFabian Werner

            1,690516




            1,690516











            • $begingroup$
              Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
              $endgroup$
              – aca06
              Apr 12 at 12:03

















            • $begingroup$
              Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
              $endgroup$
              – aca06
              Apr 12 at 12:03
















            $begingroup$
            Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
            $endgroup$
            – aca06
            Apr 12 at 12:03





            $begingroup$
            Great! Thank you. That's what I was looking for. Yes, I agree that probably my question is not well formulated.
            $endgroup$
            – aca06
            Apr 12 at 12:03














            3












            $begingroup$

            In abstract terms, neural networks are models, or if you prefer, functions with unknown parameters, where we try to learn the parameter by minimizing loss function (not just cross entropy, there are many other possibilities). In general, minimizing loss is in most cases equivalent to maximizing some likelihood function, but as discussed in this thread, it's not that simple.



            You cannot say that they are equivalent, because minimizing loss, or maximizing likelihood is a method of finding the parameters, while neural network is the function defined in terms of those parameters.






            share|cite|improve this answer









            $endgroup$








            • 1




              $begingroup$
              I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
              $endgroup$
              – Sycorax
              Apr 11 at 20:02







            • 1




              $begingroup$
              @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
              $endgroup$
              – Tim
              Apr 11 at 20:03






            • 1




              $begingroup$
              What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
              $endgroup$
              – aca06
              Apr 11 at 20:13






            • 2




              $begingroup$
              @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
              $endgroup$
              – Tim
              Apr 11 at 20:17















            3












            $begingroup$

            In abstract terms, neural networks are models, or if you prefer, functions with unknown parameters, where we try to learn the parameter by minimizing loss function (not just cross entropy, there are many other possibilities). In general, minimizing loss is in most cases equivalent to maximizing some likelihood function, but as discussed in this thread, it's not that simple.



            You cannot say that they are equivalent, because minimizing loss, or maximizing likelihood is a method of finding the parameters, while neural network is the function defined in terms of those parameters.






            share|cite|improve this answer









            $endgroup$








            • 1




              $begingroup$
              I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
              $endgroup$
              – Sycorax
              Apr 11 at 20:02







            • 1




              $begingroup$
              @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
              $endgroup$
              – Tim
              Apr 11 at 20:03






            • 1




              $begingroup$
              What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
              $endgroup$
              – aca06
              Apr 11 at 20:13






            • 2




              $begingroup$
              @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
              $endgroup$
              – Tim
              Apr 11 at 20:17













            3












            3








            3





            $begingroup$

            In abstract terms, neural networks are models, or if you prefer, functions with unknown parameters, where we try to learn the parameter by minimizing loss function (not just cross entropy, there are many other possibilities). In general, minimizing loss is in most cases equivalent to maximizing some likelihood function, but as discussed in this thread, it's not that simple.



            You cannot say that they are equivalent, because minimizing loss, or maximizing likelihood is a method of finding the parameters, while neural network is the function defined in terms of those parameters.






            share|cite|improve this answer









            $endgroup$



            In abstract terms, neural networks are models, or if you prefer, functions with unknown parameters, where we try to learn the parameter by minimizing loss function (not just cross entropy, there are many other possibilities). In general, minimizing loss is in most cases equivalent to maximizing some likelihood function, but as discussed in this thread, it's not that simple.



            You cannot say that they are equivalent, because minimizing loss, or maximizing likelihood is a method of finding the parameters, while neural network is the function defined in terms of those parameters.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered Apr 11 at 19:47









            TimTim

            60.2k9133229




            60.2k9133229







            • 1




              $begingroup$
              I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
              $endgroup$
              – Sycorax
              Apr 11 at 20:02







            • 1




              $begingroup$
              @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
              $endgroup$
              – Tim
              Apr 11 at 20:03






            • 1




              $begingroup$
              What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
              $endgroup$
              – aca06
              Apr 11 at 20:13






            • 2




              $begingroup$
              @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
              $endgroup$
              – Tim
              Apr 11 at 20:17












            • 1




              $begingroup$
              I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
              $endgroup$
              – Sycorax
              Apr 11 at 20:02







            • 1




              $begingroup$
              @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
              $endgroup$
              – Tim
              Apr 11 at 20:03






            • 1




              $begingroup$
              What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
              $endgroup$
              – aca06
              Apr 11 at 20:13






            • 2




              $begingroup$
              @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
              $endgroup$
              – Tim
              Apr 11 at 20:17







            1




            1




            $begingroup$
            I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
            $endgroup$
            – Sycorax
            Apr 11 at 20:02





            $begingroup$
            I'm trying to parse the distinction that you draw in the second paragraph. If I understand correctly, you would approve of a statement such as "My neural network model maximizes a certain log-likelihood" but not the statement "Neural networks and maximum likelihood estimators are the same concept." Is this a fair assessment?
            $endgroup$
            – Sycorax
            Apr 11 at 20:02





            1




            1




            $begingroup$
            @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
            $endgroup$
            – Tim
            Apr 11 at 20:03




            $begingroup$
            @Sycorax yes, that is correct. If it is unclear and you have idea for better re-phrasing, feel free to suggest edit.
            $endgroup$
            – Tim
            Apr 11 at 20:03




            1




            1




            $begingroup$
            What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
            $endgroup$
            – aca06
            Apr 11 at 20:13




            $begingroup$
            What if instead, we compare gradient descent and MLE ? It seems to me that they are just two methods for finding the best parameters.
            $endgroup$
            – aca06
            Apr 11 at 20:13




            2




            2




            $begingroup$
            @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
            $endgroup$
            – Tim
            Apr 11 at 20:17




            $begingroup$
            @aca06 gradient descent is an optimization algorithm, MLE is a method of estimating parameters. You can use gradient descent to find minimum of negative likelihood function (or gradient ascent for maximizing likelihood).
            $endgroup$
            – Tim
            Apr 11 at 20:17











            1












            $begingroup$

            These are fairly orthogonal topics.



            Neural networks are a type of model which has a very large number of parameters. Maximum Likelihood Estimation is a very common method for estimating parameters from a given model and data. Typically, a model will allow you to compute a likelihood function from a model, data and parameter values. Since we don't know what the actual parameter values are, one way of estimating them is to use the value that maximizes the given likelihood. Neural networks are our model, maximum likelihood estimation is one method for estimating the parameters of our model.



            One slightly technical note is that often, Maximum Likelihood Estimation is not exactly used in Neural Networks. That is, there are a lot of regularization methods used that imply we're not actually maximizing a likelihood function. These include:



            (1) Penalized maximum likelihood. This one is a bit of a cop-out, as it doesn't actually take too much effort to think of Penalized likelihoods as actually just a different likelihood (i.e., one with priors) that one is maximizing.



            (2) Random drop out. In especially a lot of the newer architectures, parameter values will randomly be set to 0 during training. This procedure is more definitely outside the realm of maximum likelihood estimation.



            (3) Early stopping. It's not the most popular method at all, but one way to prevent overfitting is just to stop the optimization algorithm before it converges. Again, this is technically not maximum likelihood estimation, it's really just an ad-hoc solution to overfitting.



            (4) Bayesian methods, probably the most common alternative to Maximum Likelihood Estimation in the statistics world, are also used for estimating the parameter values of a neural network. However, this is often too computationally intensive for large networks.






            share|cite|improve this answer











            $endgroup$

















              1












              $begingroup$

              These are fairly orthogonal topics.



              Neural networks are a type of model which has a very large number of parameters. Maximum Likelihood Estimation is a very common method for estimating parameters from a given model and data. Typically, a model will allow you to compute a likelihood function from a model, data and parameter values. Since we don't know what the actual parameter values are, one way of estimating them is to use the value that maximizes the given likelihood. Neural networks are our model, maximum likelihood estimation is one method for estimating the parameters of our model.



              One slightly technical note is that often, Maximum Likelihood Estimation is not exactly used in Neural Networks. That is, there are a lot of regularization methods used that imply we're not actually maximizing a likelihood function. These include:



              (1) Penalized maximum likelihood. This one is a bit of a cop-out, as it doesn't actually take too much effort to think of Penalized likelihoods as actually just a different likelihood (i.e., one with priors) that one is maximizing.



              (2) Random drop out. In especially a lot of the newer architectures, parameter values will randomly be set to 0 during training. This procedure is more definitely outside the realm of maximum likelihood estimation.



              (3) Early stopping. It's not the most popular method at all, but one way to prevent overfitting is just to stop the optimization algorithm before it converges. Again, this is technically not maximum likelihood estimation, it's really just an ad-hoc solution to overfitting.



              (4) Bayesian methods, probably the most common alternative to Maximum Likelihood Estimation in the statistics world, are also used for estimating the parameter values of a neural network. However, this is often too computationally intensive for large networks.






              share|cite|improve this answer











              $endgroup$















                1












                1








                1





                $begingroup$

                These are fairly orthogonal topics.



                Neural networks are a type of model which has a very large number of parameters. Maximum Likelihood Estimation is a very common method for estimating parameters from a given model and data. Typically, a model will allow you to compute a likelihood function from a model, data and parameter values. Since we don't know what the actual parameter values are, one way of estimating them is to use the value that maximizes the given likelihood. Neural networks are our model, maximum likelihood estimation is one method for estimating the parameters of our model.



                One slightly technical note is that often, Maximum Likelihood Estimation is not exactly used in Neural Networks. That is, there are a lot of regularization methods used that imply we're not actually maximizing a likelihood function. These include:



                (1) Penalized maximum likelihood. This one is a bit of a cop-out, as it doesn't actually take too much effort to think of Penalized likelihoods as actually just a different likelihood (i.e., one with priors) that one is maximizing.



                (2) Random drop out. In especially a lot of the newer architectures, parameter values will randomly be set to 0 during training. This procedure is more definitely outside the realm of maximum likelihood estimation.



                (3) Early stopping. It's not the most popular method at all, but one way to prevent overfitting is just to stop the optimization algorithm before it converges. Again, this is technically not maximum likelihood estimation, it's really just an ad-hoc solution to overfitting.



                (4) Bayesian methods, probably the most common alternative to Maximum Likelihood Estimation in the statistics world, are also used for estimating the parameter values of a neural network. However, this is often too computationally intensive for large networks.






                share|cite|improve this answer











                $endgroup$



                These are fairly orthogonal topics.



                Neural networks are a type of model which has a very large number of parameters. Maximum Likelihood Estimation is a very common method for estimating parameters from a given model and data. Typically, a model will allow you to compute a likelihood function from a model, data and parameter values. Since we don't know what the actual parameter values are, one way of estimating them is to use the value that maximizes the given likelihood. Neural networks are our model, maximum likelihood estimation is one method for estimating the parameters of our model.



                One slightly technical note is that often, Maximum Likelihood Estimation is not exactly used in Neural Networks. That is, there are a lot of regularization methods used that imply we're not actually maximizing a likelihood function. These include:



                (1) Penalized maximum likelihood. This one is a bit of a cop-out, as it doesn't actually take too much effort to think of Penalized likelihoods as actually just a different likelihood (i.e., one with priors) that one is maximizing.



                (2) Random drop out. In especially a lot of the newer architectures, parameter values will randomly be set to 0 during training. This procedure is more definitely outside the realm of maximum likelihood estimation.



                (3) Early stopping. It's not the most popular method at all, but one way to prevent overfitting is just to stop the optimization algorithm before it converges. Again, this is technically not maximum likelihood estimation, it's really just an ad-hoc solution to overfitting.



                (4) Bayesian methods, probably the most common alternative to Maximum Likelihood Estimation in the statistics world, are also used for estimating the parameter values of a neural network. However, this is often too computationally intensive for large networks.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Apr 11 at 22:24

























                answered Apr 11 at 22:08









                Cliff ABCliff AB

                13.9k12567




                13.9k12567













                    Popular posts from this blog

                    Sum ergo cogito? 1 nng

                    三茅街道4182Guuntc Dn precexpngmageondP