Introduction to Machine Learning

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@recurrent_neural_networks_notes" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Notes – Chapter 12: Recurrent Neural Networks</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@recurrent_neural_networks_top"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@recurrent_neural_networks_top" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> You can sequence through the Recurrent Neural Networks lecture video and note segments (go to Next page). </p><p> You can also (or alternatively) download the <a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Chapter 12: Recurrent Neural Networks</a> notes as a PDF file. </p> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10a_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: Recurrent neural network model</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Recurrent neural network model</h3> <div id="video_MIT6036L10a" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:BIZRv1goajo", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10a/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10a"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_top_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Introduction to RNNs</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_top"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_top" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> In chapter 8 we studied neural networks and how we can train the weights of a network, based on data, so that it will adapt into a function that approximates the relationship between the [mathjaxinline](x, y)[/mathjaxinline] pairs in a supervised-learning training set. In section 1 of chapter 10, we studied state-machine models and defined <em>recurrent neural networks</em> (<i class="sc">rnn</i>s) as a particular type of state machine, with a multidimensional vector of real values as the state. In this chapter, we'll see how to use gradient-descent methods to train the weights of an <i class="sc">rnn</i> so that it performs a <em>transduction</em> that matches as closely as possible a training set of input-output <em>sequences</em>. </p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_rnn_model_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">RNN model</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_rnn_model"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_rnn_model" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> Recall that the basic operation of the state machine is to start with some state [mathjaxinline]s_0[/mathjaxinline], then iteratively compute: </p><table id="a0000000002" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000003"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle s_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f(s_{t - 1}, x_ t)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000004"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle y_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = g(s_ t)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> as illustrated in the diagram below (remembering that there needs to be a delay on the feedback loop): </p><center><p> block = [draw, fill=blue!20, rectangle, minimum height=3em, minimum width=3em] sum = [draw, fill=blue!20, circle, node distance=1cm] input = [coordinate] output = [coordinate] pinstyle = [pin edge=to-,thin,black] </p><p><img src="/assets/courseware/v1/72aedabc93c3ec4e11fef1ebbc1f946e/asset-v1:MITx+6.036+1T2019+type@asset+block/images_rnn_rnn_model_tikzpicture_1-crop.png" width="520"/></p></center><p> So, given a sequence of inputs [mathjaxinline]x_1, x_2, \dots[/mathjaxinline] the machine generates a sequence of outputs </p><table id="a0000000005" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\underbrace{g(f(x_1, s_0))}_{y_1}, \underbrace{g(f(x_2, f(x_1, s_0)))}_{y_2}, \dots \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> A <i class="it">recurrent neural network</i> is a state machine with neural networks constituting functions [mathjaxinline]f[/mathjaxinline] and [mathjaxinline]g[/mathjaxinline]: </p><table id="a0000000006" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000007"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle f(s, x)[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_1(W^{sx}x + W^{ss}s + W^{ss}_0)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000008"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle g(s)[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_2(W^ Os + W^ O_0) \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p><span options="" class="marginote"><span class="marginote_desc" style="display:none">We are very sorry! This course material has evolved from different sources, which used [mathjaxinline]W^ Tx[/mathjaxinline] in the forward pass for regular feedforward NNs and [mathjaxinline]Wx[/mathjaxinline] for the forward pass in <i class="sc">rnn</i>s. This inconsistency doesn't make any technical difference, but is a potential source of confusion. </span><span>note</span></span> The inputs, outputs, and states are all vector-valued: </p><table id="a0000000009" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000010"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle x_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : \ell \times 1[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000011"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle s_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : m \times 1[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000012"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle y_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : v \times 1 \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> The weights in the network, then, are </p><table id="a0000000013" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000014"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W^{sx}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : m \times \ell[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000015"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W^{ss}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : m \times m[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000016"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W^{ss}_0[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : m \times 1[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000017"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W^{O}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : v \times m[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000018"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W^{O}_0[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle : v \times 1[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> with activation functions [mathjaxinline]f_1[/mathjaxinline] and [mathjaxinline]f_2[/mathjaxinline]. Finally, the operation of the <i class="sc">rnn</i> is described by </p><table id="a0000000019" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000020"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle s_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_1\left(W^{sx}x_ t + W^{ss}s_{t - 1} + W_0\right)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000021"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle y_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_2\left(W^ Os_ t + W_0^ O\right) \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Check dimensions here to be sure it all works out. Remember that we apply [mathjaxinline]f_1[/mathjaxinline] and [mathjaxinline]f_2[/mathjaxinline] elementwise.</span> <br/></p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:42 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10b_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: Sequence-to-sequence RNN</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Sequence-to-sequence RNN</h3> <div id="video_MIT6036L10b" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:KfEVnCsCrN4", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10b/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10b"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_sequence-to-sequence_rnn_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Sequence-to-sequence RNN</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_sequence-to-sequence_rnn"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_sequence-to-sequence_rnn" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> Now, how can we train an <i class="sc">rnn</i> to model a transduction on sequences? This problem is sometimes called <em>sequence-to-sequence</em> mapping. You can think of it as a kind of regression problem: given an input sequence, learn to generate the <span options="" class="marginote"><span class="marginote_desc" style="display:none">One way to think of training a sequence <b class="bf">classifier</b> is to reduce it to a transduction problem, where [mathjaxinline]y_ t = 1[/mathjaxinline] if the sequence [mathjaxinline]x_1, \ldots , x_ t[/mathjaxinline] is a <em>positive</em> example of the class of sequences and [mathjaxinline]-1[/mathjaxinline] otherwise.</span><span>corresponding output sequence. </span></span> </p><p> A training set has the form [mathjaxinline]\left[\left(x^{(1)}, y^{(1)}\right), \dots , \left(x^{(q)}, y^{(q)}\right)\right][/mathjaxinline], where </p><ul class="itemize"><li><p> [mathjaxinline]x^{(i)}[/mathjaxinline] and [mathjaxinline]y^{(i)}[/mathjaxinline] are length [mathjaxinline]n^{(i)}[/mathjaxinline] sequences; </p></li><li><p> sequences in the <i class="it">same pair</i> are the same length; and sequences in different pairs may have different lengths. </p></li></ul><p> Next, we need a loss function. We start by defining a loss function on sequences. There are many possible choices, but usually it makes sense just to sum up a per-element loss function on each of the output values, where [mathjaxinline]p[/mathjaxinline] is the predicted sequence and [mathjaxinline]y[/mathjaxinline] is the actual one: </p><table id="a0000000022" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\text {Loss}_{\text {seq}}\left(p^{(i)}, y^{(i)}\right) = \sum _{t = 1}^{n^{(q)}}\text {Loss}_\text {elt}\left(p_ t^{(i)}, y_ t^{(i)}\right) \; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> The per-element loss function [mathjaxinline]\text {Loss}_\text {elt}[/mathjaxinline] will depend on the type of [mathjaxinline]y_ t[/mathjaxinline] and what information it is encoding, in the same way as for <span options="" class="marginote"><span class="marginote_desc" style="display:none">So it could be <i class="sc">nll</i>, hinge loss, squared loss, etc.</span><span>a supervised network.</span></span> . Then, letting [mathjaxinline]\theta =\left(W^{sx}, W^{ss}, W^ O, W_0, W_0^ O\right)[/mathjaxinline], our overall objective is to minimize </p><table id="a0000000023" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]J(\theta ) = \sum _{i = 1}^ q\text {Loss}_{\text {seq}}\left( \text {RNN}(x^{(i)};\theta ), y^{(i)}\right) \; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]\text {RNN}(x; \theta )[/mathjaxinline] is the output sequence generated, given input sequence [mathjaxinline]x[/mathjaxinline]. </p><p> It is typical to choose [mathjaxinline]f_1[/mathjaxinline] to be <i class="it">tanh</i> <span options="" class="marginote"><span class="marginote_desc" style="display:none">Remember that it looks like a sigmoid but ranges from -1 to +1.</span><span>note</span></span> but any non-linear activation function is usable. We choose [mathjaxinline]f_2[/mathjaxinline] to align with the types of our outputs and the loss function, just as we would do in regular supervised learning. </p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:42 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10c_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: Back-propagation through time - forward pass</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Back-propagation through time - forward pass</h3> <div id="video_MIT6036L10c" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:HaUxM6PvRQ4", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10c/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10c"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10d_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: Back-propagation through time - backwards pass</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Back-propagation through time - backwards pass</h3> <div id="video_MIT6036L10d" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:I3bJyH1j0vQ", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10d/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10d"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10e_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: Back-propagation through time - weight updates</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Back-propagation through time - weight updates</h3> <div id="video_MIT6036L10e" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:Yt6bPJ-WRtI", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10e/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10e"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_back-propagation_through_time_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Back-propagation through time</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_back-propagation_through_time"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_back-propagation_through_time" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> Now the fun begins! We can find [mathjaxinline]\theta[/mathjaxinline] to minimize [mathjaxinline]J[/mathjaxinline] using gradient descent. We will work through the simplest method, <em>back-propagation through time</em> (<i class="sc">bptt</i>), in detail. This is generally not the best method to use, but it's relatively easy to understand. In section <ref/> we will sketch alternative methods that are in much more common use. </p><p><div style="border-radius:10px;padding:5px;border-style:solid;background-color:rgba(0,255,0,0.03);" class="examplebox"><p><b class="bf">Calculus reminder: total derivative</b> Most of us are not very careful about the difference between the <em>partial derivative</em> and the <em>total derivative</em>. We are going to use a nice example from the Wikipedia article on partial derivatives to illustrate the difference. </p><p> The volume of a cone depends on its height and radius: </p><table id="a0000000024" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]V(r, h) = \frac{\pi r^2 h}{3}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> The partial derivatives of volume with respect to height and radius are </p><table id="a0000000025" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial V}{\partial r} = \frac{2\pi r h}{3}\; \; \; \text {and}\; \; \; \frac{\partial V}{\partial h} = \frac{\pi r^2}{3}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> They measure the change in [mathjaxinline]V[/mathjaxinline] assuming everything is held constant except the single variable we are changing. But! in a cone, the radius and height are not independent, and so we can't really change one without changing the other. In this case, we really have to think about the <em>total derivative</em>, which sums the “paths" along which [mathjaxinline]r[/mathjaxinline] might influence [mathjaxinline]V[/mathjaxinline]: </p><table id="a0000000026" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000027"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{dV}{dr}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial V}{\partial r} + \frac{\partial V}{\partial h} \frac{dh}{dr}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000028"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{2 \pi r h}{3} + \frac{\pi r^2}{3} \frac{dh}{dr}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000029"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{dV}{dh}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial V}{\partial h} + \frac{\partial V}{\partial r} \frac{dr}{dh}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000030"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\pi r^2}{3} + \frac{2 \pi r h}{3} \frac{dr}{dh}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Just to be completely concrete, let's think of a right circular cone with a fixed angle [mathjaxinline]\alpha = \tan r / h[/mathjaxinline], so that if we change [mathjaxinline]r[/mathjaxinline] or [mathjaxinline]h[/mathjaxinline] then [mathjaxinline]\alpha[/mathjaxinline] remains constant. So we have [mathjaxinline]r = h \tan {^-1} \alpha[/mathjaxinline]; let constant [mathjaxinline]c = \tan ^{-1} \alpha[/mathjaxinline], so now [mathjaxinline]r = c h[/mathjaxinline]. Now, we know that </p><table id="a0000000031" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000032"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{dV}{dr}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{2 \pi r h}{3} + \frac{\pi r^2}{3} \frac{1}{c}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000033"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{dV}{dh}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\pi r^2}{3} + \frac{2 \pi r h}{3} c[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table></div></p><p> The <i class="sc">bptt</i> process goes like this: </p><ol class="enumerate"><li value="1"><p> Sample a training pair of sequences [mathjaxinline](x, y)[/mathjaxinline]; let their length be [mathjaxinline]n[/mathjaxinline]. </p></li><li value="2"><p> “Unroll" the RNN to be length [mathjaxinline]n[/mathjaxinline] (picture for [mathjaxinline]n = 3[/mathjaxinline] below), and initialize [mathjaxinline]s_0[/mathjaxinline]: </p><center><img src="/assets/courseware/v1/25cc32a872be8b6b54f11915628ba610/asset-v1:MITx+6.036+1T2019+type@asset+block/images_rnn_unrolled.png" width="400" style="scale : 0.45"/></center><p> Now, we can see our problem as one of performing what is almost an ordinary back-propagation training procedure in a feed-forward neural network, but with the difference that the weight matrices are shared among the layers. In many ways, this is similar to what ends up happening in a convolutional network, except in the conv-net, the weights are re-used spatially, and here, they are re-used temporally. </p></li><li value="3"><p> Do the <i class="it">forward pass</i>, to compute the predicted output sequence [mathjaxinline]p[/mathjaxinline]: </p><table id="a0000000034" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000035"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle z_ t^1[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W^{sx}x_ t + W^{ss}s_{t - 1} + W_0[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000036"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle s_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_1(z_ t^1)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000037"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle z_ t^2[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W^ Os_ t + W_0^ O[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000038"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle p^ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = f_2(z_ t^2)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table></li><li value="4"><p> Do <em>backward pass</em> to compute the gradients. For both [mathjaxinline]W^{ss}[/mathjaxinline] and [mathjaxinline]W^{sx}[/mathjaxinline] we need to find </p><table id="a0000000039" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000040"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{d L_\text {seq}}{d W}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{u = 1}^ n\frac{d L_ u}{d W} ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ \nonumber[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Letting [mathjaxinline]L_ u = L_\text {elt}(p_ u, y_ u)[/mathjaxinline] and using the <em>total derivative</em>, which is a sum over all the ways in which [mathjaxinline]W[/mathjaxinline] affects [mathjaxinline]L_ u[/mathjaxinline], we have </p><table id="a0000000041" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000042"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle ~ ~ ~ ~[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{u = 1}^ n\sum _{t = 1}^ n\frac{\partial L_ u}{\partial s_ t}\cdot \frac{\partial s_ t}{\partial W} \nonumber[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Re-organizing, we have </p><table id="a0000000043" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000044"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle ~ ~ ~ ~[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{t = 1}^ n\frac{\partial s_ t}{\partial W} \cdot \sum _{u = 1}^ n\frac{\partial L_ u}{\partial s_ t} \nonumber[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Because [mathjaxinline]s_ t\ \text {only affects}\ L_ t, L_{t + 1}, \dots , L_ n[/mathjaxinline], </p><table id="a0000000045" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000046"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle ~ ~ ~ ~[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{t = 1}^ n\frac{\partial s_ t}{\partial W} \cdot \sum _{u = t}^ n\frac{\partial L_ u}{\partial s_ t} \nonumber[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000047"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{t = 1}^ n\frac{\partial s_ t}{\partial W} \cdot \left(\frac{\partial L_ t}{\partial s_ t} + \underbrace{\sum _{u = t + 1}^ n\frac{\partial L_ u}{\partial s_ t}}_{\delta ^{s_ t}}\right)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none;text-align:right" class="eqnnum">(1.1)</td></tr></table><p> [mathjaxinline]\delta ^{s_ t}[/mathjaxinline] is the dependence of the loss on steps after [mathjaxinline]t[/mathjaxinline] on the state at time [mathjaxinline]t[/mathjaxinline]. <span options="" class="marginote"><span class="marginote_desc" style="display:none">That is, [mathjaxinline]\delta ^{s_ t}[/mathjaxinline] is how much we can blame state [mathjaxinline]s_ t[/mathjaxinline] for all the future element losses.</span><span>note</span></span> </p><p> We can compute this backwards, with [mathjaxinline]t[/mathjaxinline] going from [mathjaxinline]n[/mathjaxinline] down to [mathjaxinline]1[/mathjaxinline]. The trickiest part is figuring out how early states contribute to later losses. We define <i class="it">future loss</i> </p><table id="a0000000048" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]F_ t = \sum _{u = t + 1}^{n}\text {Loss}_\text {elt}(p_ u, y_ u) \; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> so </p><table id="a0000000049" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\delta ^{s_ t} = \frac{\partial F_ t}{\partial s_ t}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> At the last stage, [mathjaxinline]F_ n = 0[/mathjaxinline] so [mathjaxinline]\delta ^{s_ n} = 0[/mathjaxinline]. </p><p> Now, working backwards, </p><table id="a0000000050" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000051"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \delta ^{s_{t -1}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial }{\partial s_{t - 1}}\sum _{u = t}^ n\text {Loss}_\text {elt}(p_ u, y_ u)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000052"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial s_ t}{\partial s_{t - 1}} \cdot \frac{\partial }{\partial s_ t}\sum _{u = t}^ n\text {Loss}_\text {elt}(p_ u, y_ u)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000053"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial s_ t}{\partial s_{t - 1}} \cdot \frac{\partial }{\partial s_ t}\left[\text {Loss}_\text {elt}(p_ t, y_ t) + \sum _{u = t + 1}^ n\text {Loss}_\text {elt}(p_ u, y_ u)\right][/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000054"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial s_ t}{\partial s_{t - 1}} \cdot \left[\frac{\partial \text {Loss}_\text {elt}(p_ t, y_ t)}{\partial s_ t} + \delta ^{s_ t}\right][/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Now, we can use the chain rule again to find the dependence of the element loss at time [mathjaxinline]t[/mathjaxinline] on the state at that same time, </p><table id="a0000000055" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\underbrace{\frac{\partial \text {Loss}_\text {elt}(p_ t, y_ t)}{\partial s_ t}}_{(m \times 1)} = \underbrace{\frac{\partial z_ t^2}{\partial s_ t}}_{(m \times v)} \cdot \underbrace{\frac{\partial \text {Loss}_\text {elt}(p_ t, y_ t)}{\partial z_ t^2}}_{(v \times 1)}\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> and the dependence of the state at time [mathjaxinline]t[/mathjaxinline] on the state at the previous time, noting that we are performing an <em>elementwise</em> multiplication between [mathjaxinline]W^ T_{ss}[/mathjaxinline] and the vector of [mathjaxinline]{f^1}'[/mathjaxinline] values, [mathjaxinline]\partial s_ t /\partial z^1_ t[/mathjaxinline]: <span options="" class="marginote"><span class="marginote_desc" style="display:none">There are two ways to think about [mathjaxinline]\partial s_ t / \partial z_ t[/mathjaxinline]: here, we take the view that it is an [mathjaxinline]m \times 1[/mathjaxinline] vector and we multiply each column of [mathjaxinline]W^ T[/mathjaxinline] by it. Another, equally good, view, is that it is an [mathjaxinline]m \times m[/mathjaxinline] diagonal matrix, with the values along the diagonal, and then this operation is a matrix multiply. Our software implementation will take the first view. </span><span>note</span></span> </p><table id="a0000000056" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\underbrace{\frac{\partial s_ t}{\partial s_{t - 1}}}_{(m \times m)} = \underbrace{\frac{\partial z_ t^1}{\partial s_{t - 1}}}_{(m \times m)} \cdot \underbrace{\frac{\partial s_ t}{\partial z_ t^1}}_{(m \times 1)} = \underbrace{{W^{ss}}^ T * f^{1'}(z_ t^1)}_{\text {not dot!}}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Putting this all together, we end up with </p><table id="a0000000057" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\delta ^{s_{t - 1}} = \underbrace{{W^{ss}}^ T * f^{1'}(z_ t^1)}_{\frac{\partial s_ t}{\partial s_{t - 1}}} \cdot \underbrace{\left({W^ O}^ T\frac{\partial {L_ t}}{\partial z_ t^2} + \delta ^{s_ t}\right)}_{\frac{\partial F_{t - 1}}{\partial s_ t}}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> We're almost there! Now, we can describe the actual weight updates. Using equation <ref/> and recalling the definition of [mathjaxinline]\delta ^{s_ t} = \partial F_ t / \partial s_ t[/mathjaxinline], as we iterate backwards, we can accumulate the terms in equation <ref/> to get the gradient for the whole loss: </p><table id="a0000000058" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000059"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{ d L_\text {seq}}{d W^{ss}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle += \frac{\partial F_{t-1}}{\partial W^{ss}} = \frac{\partial z^1_ t}{\partial W^{ss}} \frac{\partial s_ t}{\partial z^1_ t} \frac{\partial F_{t-1}}{\partial s_ t}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000060"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{ d L_\text {seq}}{d W^{sx}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle += \frac{\partial F_{t-1}}{\partial W^{sx}} = \frac{\partial z^1_ t}{\partial W^{sx}} \frac{\partial s_ t}{\partial z^1_ t} \frac{\partial F_{t-1}}{\partial s_ t}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> We can handle [mathjaxinline]W^ O[/mathjaxinline] separately; it's easier because it does not effect future losses in the way that the other weight matrices do: </p><table id="a0000000061" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial L_\text {seq}}{\partial W^ O} = \sum _{t = 1}^ n\frac{\partial L_ t}{\partial W^ O} = \sum _{t = 1}^ n\frac{\partial L_ t}{\partial z_ t^2} \cdot \frac{\partial z_ t^2}{\partial W^ O}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Assuming we have [mathjaxinline]\frac{\partial L_ t}{\partial z_ t^2} = (p_ t - y_ t)[/mathjaxinline], (which ends up being true for squared loss, softmax-NLL, etc.), then on each iteration </p><table id="a0000000062" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\underbrace{\frac{\partial L_\text {seq}}{\partial W^ O}}_{v \times m} += \underbrace{(p_ t - y_ t)}_{v \times 1} \cdot \underbrace{s_ t^ T}_{1 \times m}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Whew! </p></li></ol><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Derive the updates for the offsets [mathjaxinline]W_0[/mathjaxinline] and [mathjaxinline]W^ O_0[/mathjaxinline].</span> <br/></p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:42 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10f_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: RNNs - training a language model</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: RNNs - training a language model</h3> <div id="video_MIT6036L10f" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:z5hF3tQVQpY", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10f/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10f"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_training_a_language_model_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Training a language model</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_training_a_language_model"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_training_a_language_model" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> A <em>language model</em> is just trained on a set of input sequences, [mathjaxinline](c_1^{(i)}, c_2^{(i)}, \ldots , c_{n^ i}^{(i)})[/mathjaxinline], and is used to predict the next character, given a sequence <span options="" class="marginote"><span class="marginote_desc" style="display:none">A “token" is generally a character or a word.</span><span>of previous tokens: </span></span> </p><table id="a0000000063" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]c_ t = \text {RNN}(c_1, c_2, \dots , c_{t - 1})\; \;[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> We can convert this to a sequence-to-sequence training problem by constructing a data set of [mathjaxinline](x, y)[/mathjaxinline] sequence pairs, where we make up new special tokens, [mathjaxinline]\text {start}[/mathjaxinline] and [mathjaxinline]\text {end}[/mathjaxinline], to signal the beginning and end of the sequence: </p><table id="a0000000064" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000065"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle x[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = (\langle \text {start}\rangle , c_1, c_2, \dot, c_ n)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000066"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle y[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = (c_1, c_2, \dots , \langle \text {end}\rangle )[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:42 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L10g_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Lecture: RNNs - gating mechanisms and LSTM</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="video" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: RNNs - gating mechanisms and LSTM</h3> <div id="video_MIT6036L10g" class="video closed" data-metadata='{"sources": [], "recordedYoutubeIsAvailable": true, "generalSpeed": 1.0, "showCaptions": "true", "duration": 0.0, "poster": null, "ytMetadataEndpoint": "", "ytTestTimeout": 1500, "transcriptLanguage": "en", "streams": "1.00:XLepfA7iPgE", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g/handler/transcript/translation/__lang__", "captionDataDir": null, "prioritizeHls": false, "start": 0.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g/handler/publish_completion", "completionEnabled": false, "saveStateEnabled": false, "speed": null, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g/handler/transcript/available_translations", "savedVideoPosition": 0.0, "autoAdvance": false, "autoplay": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L10g/handler/xmodule_handler/save_user_state", "end": 0.0, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "completionPercentage": 0.95}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L10g"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@rnn_vanishing_gradients_and_gating_mechanisms_vert" data-graded="False" data-init="VerticalStudentView" data-has-score="False" data-block-type="vertical" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <h2 class="hd hd-2 unit-title">Vanishing gradients and gating mechanisms</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_vanishing_gradients_and_gating_mechanisms"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@rnn_vanishing_gradients_and_gating_mechanisms" data-graded="False" data-init="XBlockToXModuleShim" data-has-score="False" data-block-type="html" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-request-token="7f50fc2a61ba11f08f6312128dc316c1"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> Let's take a careful look at the backward propagation of the gradient along the sequence: </p><table id="a0000000067" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\delta ^{s_{t -1}} = \frac{\partial s_ t}{\partial s_{t - 1}} \cdot \left[\frac{\partial \text {Loss}_\text {elt}(p_ t, y_ t)}{\partial s_ t} + \delta ^{s_ t}\right]\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Consider a case where only the output at the end of the sequence is incorrect, but it depends critically, via the weights, on the input at time 1. In this case, we will multiply the loss at step [mathjaxinline]n[/mathjaxinline] by </p><table id="a0000000068" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial s_2}{\partial s_1} \cdot \frac{\partial s_3}{\partial s_2} \cdots \frac{\partial s_ n}{\partial s_{n-1}}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> In general, this quantity will either grow or shrink exponentially with the length of the sequence, and make it very difficult to train. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">The last time we talked about exploding and vanishing gradients, it was to justify per-weight adaptive step sizes. Why is that not a solution to the problem this time?</span> <br/></p><p> An important insight that really made recurrent networks work well on long sequences was the idea of <em>gating</em>. </p><p><h3>Simple gated recurrent networks</h3></p><p> A computer only ever updates some parts of its memory on each computation cycle. We can take this idea and use it to make our networks more able to retain state values over time and to make the gradients better-behaved. We will add a new component to our network, called a <em>gating network</em>. Let [mathjaxinline]g_ t[/mathjaxinline] be a [mathjaxinline]m \times 1[/mathjaxinline] vector of values and let [mathjaxinline]W^{gx}[/mathjaxinline] and [mathjaxinline]W^{gs}[/mathjaxinline] be [mathjaxinline]m \times l[/mathjaxinline] and [mathjaxinline]m \times m[/mathjaxinline] weight matrices, respectively. We will compute [mathjaxinline]g_ t[/mathjaxinline] <span options="" class="marginote"><span class="marginote_desc" style="display:none">It can have an offset, too, but we are omitting it for simplicity.</span><span>as </span></span> </p><table id="a0000000069" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]g_ t = \text {sigmoid}(W^{gx} x_ t + W^{gs} s_{t-1})[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> and then change the computation of [mathjaxinline]s_ t[/mathjaxinline] to be </p><table id="a0000000070" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]s_ t = (1 - g_ t) * s_{t-1} + g_ t * f_1(W^{sx}x_ t + W^{ss} s_{t-1} + W_0)\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]*[/mathjaxinline] is component-wise multiplication. We can see, here, that the output of the gating network is deciding, for each dimension of the state, how much it should be updated now. This mechanism makes it much easier for the network to learn to, for example, “store" some information in some dimension of the state, and then not change it during future state updates, or change it only under certain conditions on the input or other aspects of the state. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Why is it important that the activation function for [mathjaxinline]g[/mathjaxinline] be a sigmoid?</span> <br/></p><p><h3>Long short-term memory</h3></p><p> The idea of gating networks can be applied to make a state-machine that is even more like a computer memory, resulting in a type of network called an <i class="sc">lstm</i> for “long short-term <span options="" class="marginote"><span class="marginote_desc" style="display:none">Yet another awesome name for a neural network!</span><span>memory." </span></span> We won't go into the details here, but the basic idea is that there is a memory cell (really, our state vector) and three (!) gating networks. The <em>input</em> gate selects (using a “soft" selection as in the gated network above) which dimensions of the state will be updated with new values; the <em>forget</em> gate decides which dimensions of the state will have its old values moved toward 0, and the <em>output</em> gate decides which dimensions of the state will be used to compute the output value. These networks have been used in applications like language translation with really amazing results. A diagram of the architecture is shown below: </p><center><img src="/assets/courseware/v1/8e7c0fe7c60323338b556ae490d116cf/asset-v1:MITx+6.036+1T2019+type@asset+block/images_lstm.png" width="400" style="scale : 0.7"/></center><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/0de27572f5d771b35ad094df49a8e200/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Recurrent_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:42 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>