<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_notes" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Notes – Chapter 8: Neural Networks</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_notes_top">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_notes_top" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
You can sequence through the Neural Networks lecture video and note segments (go to Next page). </p><p>
You can also (or alternatively) download the <a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Chapter 8: Neural Networks</a> notes as a PDF file. </p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05a_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - basic element</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - basic element</h3>
<div
id="video_MIT6036L05a"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:jUcdIVQyXow", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05a/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05a"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_top_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Introduction to neural networks</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_top">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_top" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
Unless you live under a rock with no internet access, you've been hearing a lot about “neural networks." Now that we have several useful machine-learning concepts (hypothesis classes, classification, regression, gradient descent, regularization, etc.) we are completely well equipped to understand neural networks in detail. </p><p>
This, in some sense, the “third wave" of neural nets. The basic idea is founded on the 1943 model of neurons of McCulloch and Pitts and learning ideas of Hebb. There was a great deal of excitement, but not a lot of practical success: there were good training methods (e.g., perceptron) for linear functions, and interesting examples of non-linear functions, but no good way to train non-linear functions from data. Interest died out for a while, but was re-kindled in the 1980s <span options="" class="marginote"><span class="marginote_desc" style="display:none">As with many good ideas in science, the basic idea for how to train non-linear neural networks with gradient descent, was independently developed by more than one researcher.</span><span>when several people </span></span> came up with a way to train neural networks with “back-propagation," which is a particular style of implementing gradient descent, which we will study here. By the mid-90s, the enthusiasm waned again, because although we could train non-linear networks, the training tended to be slow and was plagued by a problem of getting stuck in local optima. Support vector machines (<i class="sc">svm</i>s) (regularization of high-dimensional hypotheses by seeking to maximize the margin) and kernel methods (an efficient and beautiful way of using feature transformations to non-linearly transform data into a higher-dimensional space) provided reliable learning methods with guaranteed convergence and no local optima. </p><p>
However, during the <i class="sc">svm</i> enthusiasm, several groups kept working on neural networks, and their work, in combination with an increase in available data and computation, has made them rise again. They have become much more reliable and capable, and are now the method of choice in many applications. There are many, <span options="" class="marginote"><span class="marginote_desc" style="display:none">The number increases daily, as may be seen on <tt class="tt">arxiv.org</tt>.</span><span>many </span></span> variations of neural networks, which we can't even begin to survey. We will study the core “feed-forward" networks with “back-propagation" training, and then, in later chapters, address some of the major advances beyond this core. </p><p>
We can view neural networks from several different perspectives: </p><p><img src="/assets/courseware/v1/f9bd4061ff4cc13882d7a335c7def1c6/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_top_description_1-crop.png" width="895"/></p><p>
We will mostly take view 1, with the understanding that the techniques we develop will enable the applications in view 3. View 2 was a major motivation for the early development of neural networks, but the techniques we will <span options="" class="marginote"><span class="marginote_desc" style="display:none">Some prominent researchers are, in fact, working hard to find analogues of these methods in the brain</span><span>study do not </span></span> seem to actually account for the biological learning processes in brains. </p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Saturday November 16, 2019; 07:31:19 PM (revision f808f068e)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_basic_element_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Basic element</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_basic_element">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_basic_element" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
The basic element of a neural network is a “neuron," pictured schematically below. We will also sometimes refer to a neuron as a “unit" or “node." </p><center><p><img src="/assets/courseware/v1/a6adfc760c2ced88a188cf75f4299214/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_basic_element_tikzpicture_1-crop.png" width="453"/></p></center><p>
It is a non-linear function of an input vector [mathjaxinline]x \in \mathbb {R}^ m[/mathjaxinline] <span options="" class="marginote"><span class="marginote_desc" style="display:none">Sorry for changing our notation here. We were using [mathjaxinline]d[/mathjaxinline] as the dimension of the input, but we are trying to be consistent here with many other accounts of neural networks. It is impossible to be consistent with all of them though—there are many different ways of telling this story.</span><span>note</span></span> to a single output value [mathjaxinline]a \in \mathbb {R}[/mathjaxinline]. It is parameterized by a vector of <em>weights</em> [mathjaxinline](w_1, \ldots , w_ m) \in \mathbb {R}^ m[/mathjaxinline] and an <em>offset</em> or <em>threshold</em> [mathjaxinline]w_0 \in \mathbb {R}[/mathjaxinline]. <span options="" class="marginote"><span class="marginote_desc" style="display:none">This should remind you of our [mathjaxinline]\theta[/mathjaxinline] and [mathjaxinline]\theta _0[/mathjaxinline] for linear models.</span><span>note</span></span> In order for the neuron to be non-linear, we also specify an <em>activation function</em> [mathjaxinline]f : \mathbb {R}\rightarrow \mathbb {R}[/mathjaxinline], which can be the identity ([mathjaxinline]f(x) = x[/mathjaxinline]), but can also be any other function, though we will only be able to work with it if it is differentiable. </p><p>
The function represented by the neuron is expressed as: </p><table id="a0000000002" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]a = f(z) = f\left(\sum _{j=1}^ m x_ jw_ j + w_0\right) = f(w^ Tx + w_0)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
Before thinking about a whole network, we can consider how to train a single unit. Given a loss function [mathjaxinline]L(\text {\it guess}, \text {\it actual)}[/mathjaxinline] and a dataset [mathjaxinline]\{ (x^{(1)}, y^{(1)}), \ldots , (x^{(n)},y^{(n)})\}[/mathjaxinline], we can do (stochastic) gradient descent, adjusting the weights [mathjaxinline]w, w_0[/mathjaxinline] to minimize </p><table id="a0000000003" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]J(w, w_0) = \sum _{i} L\left(NN(x^{(i)}; w, w_0), y^{(i)}\right)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
where [mathjaxinline]NN[/mathjaxinline] is the output of our neural net for a given input. </p><p>
We have already studied two special cases of the neuron: linear classifiers with hinge loss and regressors with quadratic loss! Both of these have activation functions [mathjaxinline]f(x) = x[/mathjaxinline]. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> Just for a single neuron, imagine for some reason, that we decide to use activation function [mathjaxinline]f(z) = e^ z[/mathjaxinline] and loss function [mathjaxinline]L(g, a) = (g - a)^2[/mathjaxinline]. Derive a gradient descent update for [mathjaxinline]w[/mathjaxinline] and [mathjaxinline]w_0[/mathjaxinline]. </span> <br/> <br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05b_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - layer definition</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - layer definition</h3>
<div
id="video_MIT6036L05b"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:30b8FPTPVik", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05b/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05b"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05c_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - many layers</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - many layers</h3>
<div
id="video_MIT6036L05c"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:z-rkib02AuI", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05c/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05c"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_networks_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Networks</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_networks">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_networks" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
Now, we'll put multiple neurons together into a <em>network</em>. A neural network in general takes in an input [mathjaxinline]x \in \mathbb {R}^ m[/mathjaxinline] and generates an output [mathjaxinline]a \in \mathbb {R}^ n[/mathjaxinline]. It is constructed out of multiple neurons; the inputs of each neuron might be elements of [mathjaxinline]x[/mathjaxinline] and/or outputs of other neurons. The outputs are generated by [mathjaxinline]n[/mathjaxinline] <em>output units</em>. </p><p>
In this chapter, we will only consider <i class="it">feed-forward</i> networks. In a feed-forward network, you can think of the network as defining a function-call graph that is <em>acyclic</em>: that is, the input to a neuron can never depend on that neuron's output. Data flows, one way, from the inputs to the outputs, and the function computed by the network is just a composition of the functions computed by the individual neurons. </p><p>
Although the graph structure of a neural network can really be anything (as long as it satisfies the feed-forward constraint), for simplicity in software and analysis, we usually organize them into <em>layers</em>. A layer is a group of neurons that are essentially “in parallel": their inputs are outputs of neurons in the previous layer, and their outputs are the input to the neurons in the next layer. We'll start by describing a single layer, and then go on to the case of multiple layers. </p><p><h3>Single layer</h3> A <em>layer</em> is a set of units that, as we have just described, are not connected to each other. The layer is called <em>fully connected</em> if, as in the diagram below, the inputs to each unit in the layer are the same (i.e. [mathjaxinline]x_1, x_2, \ldots x_ m[/mathjaxinline] in this case). A layer has input [mathjaxinline]x \in \mathbb {R}^ m[/mathjaxinline] and output (also known as <em>activation</em>) [mathjaxinline]a \in \mathbb {R}^ n[/mathjaxinline]. </p><center><p><img src="/assets/courseware/v1/b9cdd86c9c8c5c06b57532367391bd64/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_networks_tikzpicture_1-crop.png" width="353"/></p></center><p>
Since each unit has a vector of weights and a single offset, we can think of the weights of the whole layer as a matrix, [mathjaxinline]W[/mathjaxinline], and the collection of all the offsets as a vector [mathjaxinline]W_0[/mathjaxinline]. If we have [mathjaxinline]m[/mathjaxinline] inputs, [mathjaxinline]n[/mathjaxinline] units, and [mathjaxinline]n[/mathjaxinline] outputs, then </p><ul class="itemize"><li><p>
[mathjaxinline]W[/mathjaxinline] is an [mathjaxinline]m\times n[/mathjaxinline] matrix, </p></li><li><p>
[mathjaxinline]W_0[/mathjaxinline] is an [mathjaxinline]n \times 1[/mathjaxinline] column vector, </p></li><li><p>
[mathjaxinline]X[/mathjaxinline], the input, is an [mathjaxinline]m \times 1[/mathjaxinline] column vector, </p></li><li><p>
[mathjaxinline]Z = W^ T X + W_0[/mathjaxinline], the <em>pre-activation</em>, is an [mathjaxinline]n \times 1[/mathjaxinline] column vector, </p></li><li><p>
[mathjaxinline]A[/mathjaxinline], the <em>activation</em>, is an [mathjaxinline]n \times 1[/mathjaxinline] column vector, </p></li></ul><p>
and the output vector is </p><table id="a0000000004" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]A = f(Z) = f(W^ TX + W_0)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
The activation function [mathjaxinline]f[/mathjaxinline] is applied element-wise to the pre-activation values [mathjaxinline]Z[/mathjaxinline]. </p><p>
What can we do with a single layer? We have already seen single-layer networks, in the form of linear separators and linear regressors. All we can do with a single layer is make a linear hypothesis (with some possible linear transformation on the output). The whole reason for moving to neural networks is to move in the direction of <em>non-linear</em> hypotheses. To do this, we will have to consider multiple layers. </p><p><h3>Many layers</h3> A single neural network generally combines multiple layers, most typically by feeding the outputs of one layer into the inputs of another layer. </p><p>
We have to start by establishing some nomenclature. We will use [mathjaxinline]l[/mathjaxinline] to name a layer, and let [mathjaxinline]m^ l[/mathjaxinline] be the number of inputs to the layer and [mathjaxinline]n^ l[/mathjaxinline] be the number of outputs from the layer. Then, [mathjaxinline]W^ l[/mathjaxinline] and [mathjaxinline]W^ l_0[/mathjaxinline] are of shape [mathjaxinline]m^ l \times n^ l[/mathjaxinline] and [mathjaxinline]n^ l \times 1[/mathjaxinline], respectively. Let [mathjaxinline]f^ l[/mathjaxinline] be the activation function of layer [mathjaxinline]l[/mathjaxinline]. <span options="" class="marginote"><span class="marginote_desc" style="display:none">It is technically possible to have different activation functions within the same layer, but, again, for convenience in specification and implementation, we generally have the same activation function within a layer.</span><span>note</span></span> Then, the pre-activation outputs are the [mathjaxinline]n^ l \times 1[/mathjaxinline] vector </p><table id="a0000000005" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]Z^ l = {W^ l}^ TA^{l-1} + W^ l_0[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
and the activation outputs are simply the [mathjaxinline]n^ l \times 1[/mathjaxinline] vector </p><table id="a0000000006" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]A^ l = f^ l(Z^ l)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
Here's a diagram of a many-layered network, with two blocks for each layer, one representing the linear part of the operation and one representing the non-linear activation function. We will use this structural decomposition to organize our algorithmic thinking and implementation. </p><center><p><img src="/assets/courseware/v1/a17668f7d54c60e5c12bae1aed3aef46/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_networks_tikzpicture_2-crop.png" width="866"/></p></center><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05d_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - activation functions</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - activation functions</h3>
<div
id="video_MIT6036L05d"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:SrV8aS20698", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05d/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05d"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_choices_of_activation_function_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Choices of activation function</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_choices_of_activation_function">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_choices_of_activation_function" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
There are many possible choices for the activation function. We will start by thinking about whether it's really necessary to have an [mathjaxinline]f[/mathjaxinline] at all. </p><p>
What happens if we let [mathjaxinline]f[/mathjaxinline] be the identity? Then, in a network with [mathjaxinline]L[/mathjaxinline] layers (we'll leave out [mathjaxinline]W_0[/mathjaxinline] for simplicity, but keeping it wouldn't change the form of this argument), </p><table id="a0000000007" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]A^ L = {W^ L}^ T A^{L-1} = {W^ L}^ T {W^{L-1}}^ T \cdots {W^1}^ T X\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
So, multiplying out the weight matrices, we find that </p><table id="a0000000008" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]A^ L = W^\text {total}X\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
which is a <em>linear</em> function of [mathjaxinline]X[/mathjaxinline]! Having all those layers did not change the representational capacity of the network: the non-linearity of the activation function is crucial. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Convince yourself that any function representable by any number of linear layers (where [mathjaxinline]f[/mathjaxinline] is the identity function) can be represented by a single layer.</span> <br/></p><p>
Now that we are convinced we need a non-linear activation, let's examine a few common choices. </p><p><img src="/assets/courseware/v1/44fef44c23ecc5ddbed294e70a963b81/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_choices_of_activation_function_description_1-crop.png" width="895"/></p><center><p><img src="/assets/courseware/v1/365f6085d7d987a6632151fe98514829/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_choices_of_activation_function_tikzpicture_1-crop.png" width="958"/></p></center><p>
The original idea for neural networks involved using the <b class="bf">step</b> function as an activation, but because the derivative is discontinuous, we won't be able to use gradient-descent methods to tune the weights in a network with step functions, so we won't consider them further. They have been replaced, in a sense, by the sigmoid, relu, and tanh activation functions. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> Consider sigmoid, relu, and tanh activations. Which one is most like a step function? Is there an additional parameter you could add to a sigmoid that would make it be more like a step function? </span> <br/> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> What is the derivative of the relu function? Are there some values of the input for which the derivative vanishes? </span> <br/></p><p>
ReLUs are especially common in internal (“hidden") layers, and sigmoid activations are common for the output for binary classification and softmax for multi-class classification (see section <ref/> for an explanation). </p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05e_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - training and back-propagation</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - training and back-propagation</h3>
<div
id="video_MIT6036L05e"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:2N7N6PJWqM0", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05e/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05e"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_error_back-propagation_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Error back-propagation</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_error_back-propagation">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_error_back-propagation" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
We will train neural networks using gradient descent methods. It's possible to use <em>batch</em> gradient descent, in which we sum up the gradient over all the points (as in section <ref/> of chapter <ref/>) or stochastic gradient descent (<i class="sc">sgd</i>), in which we take a small step with respect to the gradient considering a single point at a time (as in section <ref/> of chapter <ref/>). </p><p>
Our notation is going to get pretty hairy pretty quickly. To keep it as simple as we can, we'll focus on computing the contribution of one data point [mathjaxinline]x^{(i)}[/mathjaxinline] to the gradient of the loss with respect to the weights, for <i class="sc">sgd</i>; you can simply sum up these gradients over all the data points if you wish to do batch descent. </p><p>
So, to do <i class="sc">sgd</i> for a training example [mathjaxinline](x, y)[/mathjaxinline], we need to compute [mathjaxinline]\nabla _ W \text {Loss}(NN(x;W),y)[/mathjaxinline], where [mathjaxinline]W[/mathjaxinline] represents all weights [mathjaxinline]W^ l, W_0^ l[/mathjaxinline] in all the layers [mathjaxinline]l = (1, \ldots , L)[/mathjaxinline]. This seems terrifying, but is actually quite easy to do using <span options="" class="marginote"><span class="marginote_desc" style="display:none">Remember the chain rule! If [mathjaxinline]a = f(b)[/mathjaxinline] and [mathjaxinline]b = g(c)[/mathjaxinline] (so that<br/>[mathjaxinline]a = f(g(c))[/mathjaxinline]), then <br/>[mathjaxinline]\frac{d a}{d c} = \frac{d a}{d b} \cdot \frac{d b}{d c} = f'(b) g'(c) = f'(g(c)) g'(c)[/mathjaxinline].</span><span>the chain rule. </span></span> </p><p>
Remember that we are always computing the gradient of the loss function <em>with respect to the weights</em> for a particular value of [mathjaxinline](x, y)[/mathjaxinline]. That tells us how much we want to change the weights, in order to reduce the loss incurred on this particular training example. </p><p>
First, let's see how the loss depends on the weights in the final layer, [mathjaxinline]W^ L[/mathjaxinline]. Remembering that our output is [mathjaxinline]A^ L[/mathjaxinline], and using the shorthand [mathjaxinline]\text {loss}[/mathjaxinline] to stand for [mathjaxinline]\text {Loss}((NN(x;W),y)[/mathjaxinline] which is equal to [mathjaxinline]\text {Loss}(A^ L, y)[/mathjaxinline], and finally that [mathjaxinline]A^ L = f^ L(Z^ L)[/mathjaxinline] and [mathjaxinline]Z^ L = {W^ L}^ T A^{L-1}[/mathjaxinline], we can use the chain rule: </p><table id="a0000000009" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial W^ L} = \underbrace{ \frac{\partial \text {loss}}{\partial A^ L}}_{\text {depends on loss function}} \cdot \underbrace{\frac{\partial A^ L}{\partial Z^ L}}_{f^{L'}} \cdot \underbrace{\frac{\partial Z^ L}{\partial W^ L}}_{\text {$A^{L-1}$}} \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p><span options="" class="marginote"><span class="marginote_desc" style="display:none">It might reasonably bother you that [mathjaxinline]\partial {Z^ L}/\partial {W^ L} = A^{L-1}[/mathjaxinline]. We're somehow thinking about the derivative of a vector with respect to a matrix, which seems like it might need to be a three-dimensional thing. But note that [mathjaxinline]\partial {Z^ L}/\partial {W^ L}[/mathjaxinline] is really [mathjaxinline]\partial {{W^ L}^ T A^{L-1}}/\partial {W^ L}[/mathjaxinline] and it seems okay in at least an informal sense that it's [mathjaxinline]A^{L-1}[/mathjaxinline].</span><span>note</span></span></p><p>
To actually get the dimensions to match, we need to write this a bit more carefully, and note that it is true for any [mathjaxinline]l[/mathjaxinline], including [mathjaxinline]l = L[/mathjaxinline]: </p><table id="a0000000010" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\underbrace{\frac{\partial \text {loss}}{\partial W^ l}}_{m^ l \times n^ l} = \underbrace{A^{l-1}}_{m^ l \times 1} \; \underbrace{\left(\frac{\partial \text {loss}}{\partial Z^ l}\right)^ T}_{1 \times n^ l}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none;text-align:right">(1.1)</td></tr></table><p>
Yay! So, in order to find the gradient of the loss with respect to the weights in the other layers of the network, we just need to be able to find [mathjaxinline]\partial \text {loss}/\partial {Z^ l}[/mathjaxinline]. </p><p>
If we repeatedly apply the chain rule, we get this expression for the gradient of the loss with respect to the pre-activation in the first layer: </p><table id="a0000000011" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial Z^1} = \underbrace{\underbrace{ \frac{\partial \text {loss}}{\partial A^ L} \cdot \frac{\partial A^ L}{\partial Z^ L} \cdot \frac{\partial Z^ L}{\partial A^{L-1}} \cdot \frac{\partial A^{L-1}}{\partial Z^{L-1}} \cdot \cdots \cdot \frac{\partial A^2}{\partial Z^2}}_{\partial \text {loss} / \partial Z^2} \cdot \frac{\partial Z^2}{\partial A^1}} _{\partial \text {loss} / \partial A^1} \cdot \frac{\partial A^1}{\partial Z^1} \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none;text-align:right">(1.2)</td></tr></table><p>
This derivation was informal, to show you the general structure of the computation. In fact, to get the dimensions to all work out, we just have to write it backwards! Let's first understand more about these quantities: </p><ul class="itemize"><li><p>
[mathjaxinline]\partial \text {loss}/\partial A^ L[/mathjaxinline] is [mathjaxinline]n^ L \times 1[/mathjaxinline] and depends on the particular loss function you are using. </p></li><li><p>
[mathjaxinline]\partial Z^ l / \partial A^{l-1}[/mathjaxinline] is [mathjaxinline]m^ l \times n^ l[/mathjaxinline] and is just [mathjaxinline]W^ l[/mathjaxinline] (you can verify this by computing a single entry [mathjaxinline]\partial Z^ l_ i / \partial A^{l-1}_ j[/mathjaxinline]). </p></li><li><p>
[mathjaxinline]\partial A^ l/\partial Z^ l[/mathjaxinline] is [mathjaxinline]n^ l \times n^ l[/mathjaxinline]. It's a little tricky to think about. Each element [mathjaxinline]a_ i^ l = f^ l(z_ i^ l)[/mathjaxinline]. This means that [mathjaxinline]\partial a_ i^ l / \partial z_ j^ l = 0[/mathjaxinline] whenever [mathjaxinline]i \not= j[/mathjaxinline]. So, the off-diagonal elements of [mathjaxinline]\partial A^ l/\partial Z^ l[/mathjaxinline] are all 0, and the diagonal elements are [mathjaxinline]\partial a_ i^ l / \partial z_ j^ l = {f^ l}'(z_ j^ l)[/mathjaxinline]. </p></li></ul><p>
Now, we can rewrite equation <ref/> so that the quantities match up as </p><table id="a0000000012" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial Z^ l} = \frac{\partial A^ l}{\partial Z^ l} \cdot W^{l+1} \cdot \frac{\partial A^{l+1}}{\partial Z^{l+1}} \cdot \ldots W^{L-1} \cdot \frac{\partial A^{L-1}}{\partial Z^{L-1}} \cdot W^{L} \cdot \frac{\partial A^{L}}{\partial Z^{L}} \cdot \frac{\partial \text {loss}}{A^ L}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none;text-align:right">(1.3)</td></tr></table><p>
Using equation <ref/> to compute [mathjaxinline]\partial \text {loss}/\partial {Z^ l}[/mathjaxinline] combined with equation <ref/>, lets us find the gradient of the loss with respect to any of the weight matrices. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Apply the same reasoning to find the gradients of [mathjaxinline]\text {loss}[/mathjaxinline] with respect to [mathjaxinline]W_0^ l[/mathjaxinline].</span> <br/></p><p>
This general process is called <em>error back-propagation</em>. The idea is that we first do a <em>forward pass</em> to compute all the [mathjaxinline]a[/mathjaxinline] and [mathjaxinline]z[/mathjaxinline] values at all the layers, and finally the actual loss on this example. Then, we can work backward and compute the gradient of the loss with respect to the weights in each layer, starting at layer [mathjaxinline]L[/mathjaxinline] and going back to layer 1. <span options="" class="marginote"><span class="marginote_desc" style="display:none">I like to think of this as “blame propagation". You can think of [mathjaxinline]\text {loss}[/mathjaxinline] as how mad we are about the prediction that the network just made. Then [mathjaxinline]\partial \text {loss}/ \partial A^ L[/mathjaxinline] is how much we blame [mathjaxinline]A^ L[/mathjaxinline] for the loss. The last module has to take in [mathjaxinline]\partial \text {loss}/ \partial A^ L[/mathjaxinline] and compute [mathjaxinline]\partial \text {loss}/ \partial Z^ L[/mathjaxinline], which is how much we blame [mathjaxinline]Z^ L[/mathjaxinline] for the loss. The next module (working backwards) takes in [mathjaxinline]\partial \text {loss}/ \partial Z^ L[/mathjaxinline] and computes [mathjaxinline]\partial \text {loss}/ \partial A^{L-1}[/mathjaxinline]. So every module is accepting its blame for the loss, computing how much of it to allocate to each of its inputs, and passing the blame back to them.</span><span>note</span></span> </p><center><p><img src="/assets/courseware/v1/b1b796d42ff1b50da8275c2f19070533/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_error_back-propagation_tikzpicture_1-crop.png" width="893"/></p></center><p>
If we view our neural network as a sequential composition of modules (in our work so far, it has been an alternation between a linear transformation with a weight matrix, and a component-wise application of a non-linear activation function), then we can define a simple API for a module that will let us compute the forward and backward passes, as well as do the necessary weight updates for gradient descent. Each module has to provide the following “methods." We are already using letters [mathjaxinline]a, x, y, z[/mathjaxinline] with particular meanings, so here we will use [mathjaxinline]u[/mathjaxinline] as the vector input to the module and [mathjaxinline]v[/mathjaxinline] as the vector output: </p><ul class="itemize"><li><p>
forward: [mathjaxinline]u \rightarrow v[/mathjaxinline] </p></li><li><p>
backward: [mathjaxinline]u, v, \partial L / \partial v \rightarrow \partial L / \partial u[/mathjaxinline] </p></li><li><p>
weight grad: [mathjaxinline]u, \partial L / \partial v \rightarrow \partial L / \partial W[/mathjaxinline] only needed for modules that have weights [mathjaxinline]W[/mathjaxinline] </p></li></ul><p>
In homework we will ask you to implement these modules for neural network components, and then use them to construct a network and train it as described in the next section. </p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05f_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - backprop with the chain rule</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - backprop with the chain rule</h3>
<div
id="video_MIT6036L05f"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:B9BHcTxUMMc", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05f/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05f"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05g_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - weight initialization</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - weight initialization</h3>
<div
id="video_MIT6036L05g"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:xOy85gT5wOo", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05g/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05g"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_training_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Training</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_training">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_training" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
Here we go! Here's how to do stochastic gradient descent training on a feed-forward neural network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3. The actual computation of the gradient values (e.g. [mathjaxinline]\partial \text {loss}/ \partial A^ L[/mathjaxinline]) is not directly defined in this code, because we want to make the structure of the computation clear. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">What is [mathjaxinline]\partial Z^ l / \partial W^ l[/mathjaxinline]? </span> <br/> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Which terms in the code below depend on [mathjaxinline]f^ L[/mathjaxinline]?</span> <br/></p><p><img src="/assets/courseware/v1/2275d12644e58f99416e8653f1fb2498/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_training_codebox_1-crop.png" width="809"/></p><p>
Initializing [mathjaxinline]W[/mathjaxinline] is important; if you do it badly there is a good chance the neural network training won't work well. First, it is important to initialize the weights to random values. We want different parts of the network to tend to “address" different aspects of the problem; if they all start at the same weights, the symmetry will often keep the values from moving in useful directions. Second, many of our activation functions have (near) zero slope when the pre-activation [mathjaxinline]z[/mathjaxinline] values have large magnitude, so we generally want to keep the initial weights small so we will be in a situation where the gradients are non-zero, so that gradient descent will have some useful signal about which way to go. </p><p>
One good general-purpose strategy is to choose each weight at random from a Gaussian (normal) distribution with mean 0 and standard deviation [mathjaxinline](1/m)[/mathjaxinline] where [mathjaxinline]m[/mathjaxinline] is the number of inputs to the unit. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">If the input [mathjaxinline]x[/mathjaxinline] to this unit is a vector of 1's, what would the expected pre-activation [mathjaxinline]z[/mathjaxinline] value be with these initial weights?</span> <br/>We write this choice (where [mathjaxinline]\sim[/mathjaxinline] means “is drawn randomly from the distribution") </p><table id="a0000000013" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]W^ l_{ij} \sim \text {Gaussian}\left(0, \frac{1}{m^ l}\right)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
It will often turn out (especially for fancier activations and loss functions) that computing </p><table id="a0000000014" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial Z^ L}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
is easier than computing </p><table id="a0000000015" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial A^ L}\; \; \text { and }\; \; \frac{\partial A^ L}{\partial Z^ L}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
So, we may instead ask for an implementation of a loss function to provide a backward method that computes [mathjaxinline]\partial \text {loss}/\partial Z^ L[/mathjaxinline] directly. </p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L05h_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Lecture: Neural networks - output layer activation functions</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h" data-init="XBlockToXModuleShim" data-block-type="video" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: Neural networks - output layer activation functions</h3>
<div
id="video_MIT6036L05h"
class="video closed"
data-metadata='{"autoAdvance": false, "prioritizeHls": false, "recordedYoutubeIsAvailable": true, "ytTestTimeout": 1500, "poster": null, "streams": "1.00:AtYktvqEgNA", "saveStateEnabled": false, "end": 0.0, "speed": null, "completionPercentage": 0.95, "start": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h/handler/publish_completion", "duration": 0.0, "autoplay": false, "savedVideoPosition": 0.0, "generalSpeed": 1.0, "autohideHtml5": false, "ytMetadataEndpoint": "", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h/handler/transcript/translation/__lang__", "showCaptions": "true", "completionEnabled": false, "captionDataDir": null, "ytApiUrl": "https://www.youtube.com/iframe_api", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h/handler/xmodule_handler/save_user_state", "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L05h/handler/transcript/available_translations", "sources": [], "transcriptLanguages": {"en": "English"}, "transcriptLanguage": "en", "lmsRootURL": "https://openlearninglibrary.mit.edu"}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L05h"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_loss_functions_and_activation_functions_vert" data-init="VerticalStudentView" data-block-type="vertical" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<h2 class="hd hd-2 unit-title">Loss functions and activation functions</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_loss_functions_and_activation_functions">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-request-token="ce7a329e05cb11f0a1780affe2bbc7c1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_loss_functions_and_activation_functions" data-init="XBlockToXModuleShim" data-block-type="html" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+6.036+1T2019" data-has-score="False" data-graded="False" data-runtime-version="1">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
Different loss functions make different assumptions about the range of inputs they will get as input and, as we have seen, different activation functions will produce output values in different ranges. When you are designing a neural network, it's important to make these things fit together well. In particular, we will think about matching loss functions with the activation function in the last layer, [mathjaxinline]f^ L[/mathjaxinline]. Here is a table of loss functions and activations that make sense for them: </p><center><table cellspacing="0" class="tabular" style="table-layout:auto"><tr><td style="text-align:center; border:none">
Loss </td><td style="text-align:center; border:none">
[mathjaxinline]f^ L[/mathjaxinline] </td></tr><tr><td style="border-top-style:solid; border-top-color:black; border-top-width:1px; text-align:center; border:none">
squared </td><td style="border-top-style:solid; border-top-color:black; border-top-width:1px; text-align:center; border:none">
linear </td></tr><tr><td style="text-align:center; border:none">
hinge </td><td style="text-align:center; border:none">
linear </td></tr><tr><td style="text-align:center; border:none">
[mathjaxinline]\text {NLL}[/mathjaxinline] </td><td style="text-align:center; border:none">
sigmoid </td></tr><tr><td style="text-align:center; border:none">
[mathjaxinline]\text {NLLM}[/mathjaxinline] </td><td style="text-align:center; border:none">
softmax </td></tr></table></center><p>
But what is NLL? </p><p><h3>Two-class classification and log likelihood</h3></p><p>
For classification, the natural loss function is 0-1 loss, but we have already discussed the fact that it's very inconvenient for gradient-based learning because its derivative is discontinuous. Hinge loss gives us a way, for binary classification problems, to make a smoother objective. An alternative loss function that has a nice probabilistic interpretation, is in popular use, and extends nicely to multi-class classification, is called <em>negative log likelihood</em> (<i class="sc">nll</i>). We will discuss it first in the two-class case, and then generalize to multiple classes. </p><p>
Let's assume that the activation function on the output layer is a sigmoid and that there is a single unit in the output layer, so the output of the whole neural network is a scalar, [mathjaxinline]a^ L[/mathjaxinline]. Because [mathjaxinline]f^ L[/mathjaxinline] is a sigmoid, we know [mathjaxinline]a^ L \in [0, 1][/mathjaxinline], and we can interpret it as the probability that the input [mathjaxinline]x[/mathjaxinline] is a positive example. Let us further assume that the labels in the training data are [mathjaxinline]y \in \{ 0, 1\}[/mathjaxinline], so they can also be interpreted as probabilities. </p><p>
We might want to pick the parameters of our network to maximize the probability that the network assigns the correct labels to all the points. That would be </p><table id="a0000000016" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\prod _{i = 1}^ n \begin{cases} a^{(i)} & \text {if $y^{(i)} = 1$} \\ 1 - a^{(i)} & \text {otherwise} \end{cases}\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
under the assumption that our predictions are independent. This can be cleverly rewritten as </p><table id="a0000000017" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\prod _{i = 1}^ n {a^{(i)}}^{y^{(i)}}(1 - a^{(i)})^{1 - y^{(i)}}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
<br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Be sure you can see why these two expressions are the same.</span> <br/></p><p>
Now, because products are kind of hard to deal with, and because the log function is monotonic, the [mathjaxinline]W[/mathjaxinline] that maximizes the log of this quantity will be the same as the [mathjaxinline]W[/mathjaxinline] that maximizes the original, so we can try to maximize </p><table id="a0000000018" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\sum _{i = 1}^ n {y^{(i)}}\log {a^{(i)}} + (1 - y^{(i)})\log (1 - a^{(i)})\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
which we can write in terms of a loss function </p><table id="a0000000019" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\sum _{i = 1}^ n \mathcal{L}_\text {nll}(a^{(i)}, y^{(i)})[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
where [mathjaxinline]\mathcal{L}_\text {nll}[/mathjaxinline] is the <em>negative log likelihood</em> loss function: </p><table id="a0000000020" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\mathcal{L}_\text {nll}(\text {guess},\text {actual}) = -\left(\text {actual}\cdot \log (\text {guess}) + (1 - \text {actual})\cdot \log (1 - \text {guess})\right) \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
This loss function is also sometimes referred to as the <em>log loss</em> or <span options="" class="marginote"><span class="marginote_desc" style="display:none">You can use any base for the logarithm and it won't make any real difference. If we ask you for numbers, use log base [mathjaxinline]e[/mathjaxinline].</span><span><em>cross entropy</em>. </span></span> </p><p><h3>Multi-class classification and log likelihood</h3> We can extend this idea directly to multi-class classification with [mathjaxinline]K[/mathjaxinline] classes, where the training label is represented with the one-hot vector [mathjaxinline]y=\begin{bmatrix} y_1, \ldots , y_ K \end{bmatrix}^ T[/mathjaxinline], where [mathjaxinline]y_ k=1[/mathjaxinline] if the example is of class [mathjaxinline]k[/mathjaxinline]. Assume that our network uses <em>softmax</em> as the activation function in the last layer, so that the output is [mathjaxinline]a=\begin{bmatrix} a_1, \ldots , a_ K \end{bmatrix}^ T[/mathjaxinline], which represents a probability distribution over the [mathjaxinline]K[/mathjaxinline] possible classes. Then, the probability that our network predicts the correct class for this example is [mathjaxinline]\prod _{k=1}^ K a_ k^{y_ k}[/mathjaxinline] and the log of the probability that it is correct is [mathjaxinline]\sum _{k=1}^ K y_ k \log a_ k[/mathjaxinline], so </p><table id="a0000000021" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\mathcal{L}_\text {nllm}(\text {guess},\text {actual}) = - \sum _{k=1}^ K \text {actual}_ k \cdot \log (\text {guess}_ k) \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
We'll call this <i class="sc">nllm</i> for <em>negative log likelihood multiclass.</em> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Show that [mathjaxinline]L_\text {nllm}[/mathjaxinline] for [mathjaxinline]K = 2[/mathjaxinline] is the same as [mathjaxinline]L_\text {nll}[/mathjaxinline]. </span> <br/></p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:57 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
© All Rights Reserved