Introduction to Machine Learning

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_notes_2" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Notes – Chapter 8: Neural Networks</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_notes_top_2"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_notes_top_2" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="html" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> You can sequence through the Neural Networks II lecture video and note segments (go to Next page). </p><p> This week's lecture notes are a continuation (later sections) of <a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Neural_Networks.pdf" target="_blank">Chapter 8: Neural Networks</a> which you can download as a PDF file. </p> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06a_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - brief review of layers and backprop</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - brief review of layers and backprop</h3> <div id="video_MIT6036L06a" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:S4l7XuMVxwA", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06a/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06a"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06b_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - batch gradient descent training</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - batch gradient descent training</h3> <div id="video_MIT6036L06b" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:9ETYna-tOMQ", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06b/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06b"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06c_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - adaptive step-size</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - adaptive step-size</h3> <div id="video_MIT6036L06c" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:gQafJTkYfv4", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06c/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06c"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06d_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - running averages</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - running averages</h3> <div id="video_MIT6036L06d" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:tO_lQilOutA", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06d/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06d"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06e_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - momentum</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - momentum</h3> <div id="video_MIT6036L06e" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:OXdtfhiLsWk", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06e/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06e"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06f_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - adagrad and adadelta</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - adagrad and adadelta</h3> <div id="video_MIT6036L06f" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:B0Zixpgysns", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06f/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06f"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06g_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - optimizing parameters - adam step-size update strategy</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - optimizing parameters - adam step-size update strategy</h3> <div id="video_MIT6036L06g" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:y6eDig5zgZM", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06g/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06g"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_2_optimizing_neural_network_parameters_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Optimizing neural network parameters</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_2_optimizing_neural_network_parameters"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_2_optimizing_neural_network_parameters" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="html" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> Because neural networks are just parametric functions, we can optimize loss with respect to the parameters using standard gradient-descent software, but we can take advantage of the structure of the loss function and the hypothesis class to improve optimization. As we have seen, the modular function-composition structure of a neural network hypothesis makes it easy to organize the computation of the gradient. As we have also seen earlier, the structure of the loss function as a sum over terms, one per training data point, allows us to consider stochastic gradient methods. In this section we'll consider some alternative strategies for organizing training, and also for making it easier to handle the step-size parameter. </p><p><h3>Batches</h3> Assume that we have an objective of the form </p><table id="a0000000002" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]J(W) = \sum _{i = 1}^ n \mathcal{L}(h(x^{(i)}; W), y^{(i)})\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]h[/mathjaxinline] is the function computed by a neural network, and [mathjaxinline]W[/mathjaxinline] stands for all the weight matrices and vectors in the network. </p><p> When we perform <em>batch</em> gradient descent, we use the update rule </p><table id="a0000000003" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]W := W - \eta \nabla _ W J(W)\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> which is equivalent to </p><table id="a0000000004" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]W := W - \eta \sum _{i=1}^ n \nabla _ W \mathcal{L}(h(x^{(i)}; W), y^{(i)})\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> So, we sum up the gradient of loss at each training point, with respect to [mathjaxinline]W[/mathjaxinline], and then take a step in the negative direction of the gradient. </p><p> In <em>stochastic</em> gradient descent, we repeatedly pick a point [mathjaxinline](x^{(i)}, y^{(i)})[/mathjaxinline] at random from the data set, and execute a weight update on that point alone: </p><table id="a0000000005" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]W := W - \eta \nabla _ W \mathcal{L}(h(x^{(i)}; W), y^{(i)})\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> As long as we pick points uniformly at random from the data set, and decrease [mathjaxinline]\eta[/mathjaxinline] at an appropriate rate, we are guaranteed, with high probability, to converge to at least a local optimum. </p><p> These two methods have offsetting virtues. The batch method takes steps in the exact gradient direction but requires a lot of computation before even a single step can be taken, especially if the data set is large. The stochastic method begins moving right away, and can sometimes make very good progress before looking at even a substantial fraction of the whole data set, but if there is a lot of variability in the data, it might require a very small [mathjaxinline]\eta[/mathjaxinline] to effectively average over the individual steps moving in “competing" directions. </p><p> An effective strategy is to “average" between batch and stochastic gradient descent by using <em>mini-batches</em>. For a mini-batch of size [mathjaxinline]k[/mathjaxinline], we select [mathjaxinline]k[/mathjaxinline] distinct data points uniformly at random from the data set and do the update based just on their contributions to the gradient </p><table id="a0000000006" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]W := W - \eta \sum _{i=1}^ k \nabla _ W \mathcal{L}(h(x^{(i)}; W), y^{(i)})\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Most neural network software packages are set up to do mini-batches. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">For what value of [mathjaxinline]k[/mathjaxinline] is mini-batch gradient descent equivalent to stochastic gradient descent? To batch gradient descent?</span> <br/></p><p> Picking [mathjaxinline]k[/mathjaxinline] unique data points at random from a large data-set is potentially computationally difficult. An alternative strategy, if you have an efficient procedure for randomly shuffling the data set (or randomly shufffling a list of indices into the data set) is to operate in a loop, roughly as follows: </p><p><img src="/assets/courseware/v1/6bcd49c40768c372a16b54595aac0b9c/asset-v1:MITx+6.036+1T2019+type@asset+block/images_neural_networks_2_optimizing_neural_network_parameters_codebox_1-crop.png" width="650"/></p><p><h3>Adaptive step-size</h3> Picking a value for [mathjaxinline]\eta[/mathjaxinline] is difficult and time-consuming. If it's too small, then convergence is slow and if it's too large, then we risk divergence or slow convergence due to oscillation. This problem is even more pronounced in stochastic or mini-batch mode, because we know we need to decrease the step size for the formal guarantees to hold. </p><p> It's also true that, within a single neural network, we may well want to have different step sizes. As our networks become <em>deep</em> (with increasing numbers of layers) we can find that magnitude of the gradient of the loss with respect the weights in the last layer, [mathjaxinline]\partial \text {loss} / \partial W_ L[/mathjaxinline], may be substantially different from the gradient of the loss with respect to the weights in the first layer [mathjaxinline]\partial \text {loss} / \partial W_ L[/mathjaxinline]. If you look carefully at equation <ref/>, you can see that the output gradient is multiplied by all the weight matrices of the network and is “fed back" through all the derivatives of all the activation functions. This can lead to a problem of <em>exploding</em> or <em>vanishing</em> gradients, in which the back-propagated gradient is much too big or small to be used in an update rule with the same step size. </p><p> So, we'll consider having an independent step-size parameter <em>for each weight</em>, and updating it based on a local view of how the gradient updates <span options="" class="marginote"><span class="marginote_desc" style="display:none">This section is very strongly influenced by Sebastian Ruder's excellent blog posts on the topic: <tt class="tt"><small class="scriptsize">ruder.io/ optimizing-gradient-descent</small></tt></span><span>have been going.</span></span> </p><p><h4>Running averages</h4></p><p> We'll start by looking at the notion of a <em>running average</em>. It's a computational strategy for estimating a possibly weighted average of a sequence of data. Let our data sequence be [mathjaxinline]a_1, a_2, \ldots[/mathjaxinline]; then we define a sequence of running average values, [mathjaxinline]A_0, A_1, A_2, \ldots[/mathjaxinline] using the equations </p><table id="a0000000007" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000008"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle A_0[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = 0[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000009"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle A_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma _ t A_{t-1} + (1 - \gamma _ t) a_ t[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> where [mathjaxinline]\gamma _ t \in (0, 1)[/mathjaxinline]. If [mathjaxinline]\gamma _ t[/mathjaxinline] is a constant, then this is a <em>moving</em> average, in which </p><table id="a0000000010" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000011"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle A_ T[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma A_{T-1} + (1 - \gamma ) a_ T[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000012"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma (A_{T-2} + (1 - \gamma ) a_{T-1}) + (1 - \gamma ) a_ T[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000013"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _{t = 0}^ T \gamma ^{T-t}(1 - \gamma ) a_ t[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> So, you can see that inputs [mathjaxinline]a_ t[/mathjaxinline] closer to the end of the sequence have more effect on [mathjaxinline]A_ t[/mathjaxinline] than early inputs. </p><p> If, instead, we set [mathjaxinline]\gamma _ t = (t - 1) / t[/mathjaxinline], then we get the actual average. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> Prove to yourself that the previous assertion holds. </span> <br/></p><p><h4>Momentum</h4> Now, we can use methods that are a bit like running averages to describe strategies for computing [mathjaxinline]\eta[/mathjaxinline]. The simplest method is <em>momentum</em>, in which we try to “average" recent gradient updates, so that if they have been bouncing back and forth in some direction, we take out that component of the motion. For momentum, we have </p><table id="a0000000014" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000015"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle V_0[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = 0[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000016"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle V_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma V_{t-1} + \eta \nabla _ W J(W_{t-1})[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000017"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1} - V_ t[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> This doesn't quite look like an adaptive step size. But what we can see is that, if we let [mathjaxinline]\eta = \eta '(1 - \gamma )[/mathjaxinline], then the rule looks exactly like doing an update with step size [mathjaxinline]\eta '[/mathjaxinline] on a moving average of the gradients with parameter [mathjaxinline]\gamma[/mathjaxinline]: </p><table id="a0000000018" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000019"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle V_0[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = 0[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000020"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle M_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma M_{t-1} + (1 - \gamma ) \nabla _ W J(W_{t-1})[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000021"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1} - \eta ' M_ t[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Prove to yourself that these formulations are equivalent.</span> <br/></p><p> We will find that [mathjaxinline]V_ t[/mathjaxinline] will be bigger in dimensions that consistently have the same sign for [mathjaxinline]\nabla _{\theta }[/mathjaxinline] and smaller for those that don't. Of course we now have <em>two</em> parameters to set ([mathjaxinline]\eta[/mathjaxinline] and [mathjaxinline]\gamma[/mathjaxinline]), but the hope is that the algorithm will perform better overall, so it will be worth trying to find good values for them. Often [mathjaxinline]\gamma[/mathjaxinline] is set to be something like [mathjaxinline]0.9[/mathjaxinline]. <div style="border-radius:10px;padding:5px;border-style:solid;background-color:rgba(0,255,0,0.03);" class="examplebox"><center><img src="/assets/courseware/v1/41a63f2d81d2ce6976fd3efaf714394c/asset-v1:MITx+6.036+1T2019+type@asset+block/images_momentum.png" width="400" style="scale:0.7"/></center><p> The red arrows show the update after one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient with respect to the mini-batch at each step. Momentum smooths the path taken towards the local minimum and leads to faster convergence. </p></div> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">If you set [mathjaxinline]\gamma = 0.1[/mathjaxinline], would momentum have more of an effect or less of an effect than if you set it to [mathjaxinline]0.9[/mathjaxinline]? </span> <br/></p><p><h4>Adadelta</h4> Another useful idea is this: we would like to take larger steps in parts of the space where [mathjaxinline]J(W)[/mathjaxinline] is nearly flat (because there's no risk of taking too big a step due to the gradient being large) and smaller steps when it is steep. We'll apply this idea to each weight independently, and end up with a method called <em>adadelta</em>, which is a variant on <em>adagrad</em> (for adaptive gradient). Even though our weights are indexed by layer, input unit and output unit, for simplicity here, just let [mathjaxinline]W_ j[/mathjaxinline] be any weight in the network (we will do the same thing for all of them). </p><table id="a0000000022" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000023"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle g_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \nabla _ W J(W_{t-1})_ j[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000024"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle G_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \gamma G_{t - 1,j} + (1 - \gamma )g_{t,j}^2[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000025"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1, j} - \frac{\eta }{\sqrt {G_{t,j} + \epsilon }}g_{t,j}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> The sequence [mathjaxinline]G_{t,j}[/mathjaxinline] is a moving average of the square of the [mathjaxinline]j[/mathjaxinline]th component of the gradient. We square it in order to be insensitive to the sign—we want to know whether the magnitude is big or small. Then, we perform a gradient update to weight [mathjaxinline]j[/mathjaxinline], but divide the step size by [mathjaxinline]\sqrt {G_{t,j} + \epsilon }[/mathjaxinline], which is larger when the surface is steeper in direction [mathjaxinline]j[/mathjaxinline] at point [mathjaxinline]W_{t-1}[/mathjaxinline] in weight space; this means that the step size will be smaller when it's steep and larger when it's flat. </p><p><h4>Adam</h4> Adam has become the default method of managing step <span options="" class="marginote"><span class="marginote_desc" style="display:none">Although, interestingly, it may actually violate the convergence conditions of <i class="sc">sgd</i>: <small class="scriptsize"><tt class="tt">arxiv.org/abs/1705.08292</tt></small></span><span>sizes neural networks</span></span> . It combines the ideas of momentum and and adadelta. We start by writing moving averages of the gradient and squared gradient, which reflect estimates of the mean and variance of the gradient for weight [mathjaxinline]j[/mathjaxinline]: </p><table id="a0000000026" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000027"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle g_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \nabla _ W J(W_{t-1})_ j[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000028"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle m_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = B_1m_{t - 1,j} + (1 - B_1)g_{t,j}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000029"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle v_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = B_2v_{t - 1,j} + (1 - B_2)g_{t,j}^2 \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> A problem with these estimates is that, if we initialize [mathjaxinline]m_0 = v_0 = 0[/mathjaxinline], they will always be biased (slightly too small). So we will correct for that bias by defining </p><table id="a0000000030" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000031"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \hat{m}_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{m_{t,j}}{1 - B^ t_1}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000032"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \hat{v}_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{v_{t,j}}{1 - B^ t_2}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000033"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W_{t,j}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1,j} - \frac{\eta }{\sqrt {\hat{v}_{t,j} + \epsilon }}\hat{m}_{t,j} \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Note that [mathjaxinline]B^ t_1[/mathjaxinline] is [mathjaxinline]B_1[/mathjaxinline] raised to the power [mathjaxinline]t[/mathjaxinline], and likewise for [mathjaxinline]B^ t_2[/mathjaxinline]. To justify these corrections, note that if we were to expand [mathjaxinline]m_{t,j}[/mathjaxinline] in terms of [mathjaxinline]m_{0,j}[/mathjaxinline] and [mathjaxinline]g_{0,j}, g_{1,j}, \dots , g_{t,j}[/mathjaxinline] the coefficients would sum to [mathjaxinline]1[/mathjaxinline]. However, the coefficient behind [mathjaxinline]m_{0,j}[/mathjaxinline] is [mathjaxinline]B_1^ t[/mathjaxinline] and since [mathjaxinline]m_{0,j} = 0[/mathjaxinline], the sum of coefficients of non-zero terms is [mathjaxinline]1 - B_1^ t[/mathjaxinline], hence the correction. The same justification holds for [mathjaxinline]v_{t,j}[/mathjaxinline]. </p><p> Now, our update for weight [mathjaxinline]j[/mathjaxinline] has a step size that takes the steepness into account, as in adadelta, but also tends to move in the same direction, as in momentum. The authors of this method propose setting [mathjaxinline]B_1 = 0.9, B_2 = 0.999, \epsilon = 10^{-8}[/mathjaxinline]. Although we now have even more parameters, Adam is not highly sensitive to their values (small changes do not have a huge effect on the result). <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Define [mathjaxinline]\hat{m_ j}[/mathjaxinline] directly as a moving average of [mathjaxinline]g_{t,j}[/mathjaxinline]. What is the decay ([mathjaxinline]\gamma[/mathjaxinline] parameter)? </span> <br/>Even though we now have a step-size for each weight, and we have to update various quantities on each iteration of gradient descent, it's relatively easy to implement by maintaining a matrix for each quantity ([mathjaxinline]m^{\ell }_ t, v^{\ell }_ t, g^{\ell }_ t, {g^{2}_ t}^{\ell }[/mathjaxinline]) in each layer of the network. </p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Making_NN_s_Work.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:06 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06h_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - regularization by weight decay</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - regularization by weight decay</h3> <div id="video_MIT6036L06h" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:81nv6dx0HQE", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06h/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06h"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06j_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - regularization by early stopping and dropout</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - regularization by early stopping and dropout</h3> <div id="video_MIT6036L06j" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:4-kgCZ6S6rs", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06j/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06j"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L06k_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Lecture: Neural networks - regularization by batch normalization</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="video" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Neural networks - regularization by batch normalization</h3> <div id="video_MIT6036L06k" class="video closed" data-metadata='{"prioritizeHls": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k/handler/transcript/available_translations", "start": 0.0, "completionPercentage": 0.95, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k/handler/publish_completion", "generalSpeed": 1.0, "ytMetadataEndpoint": "", "sources": [], "end": 0.0, "ytTestTimeout": 1500, "savedVideoPosition": 0.0, "autoplay": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "recordedYoutubeIsAvailable": true, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autohideHtml5": false, "showCaptions": "true", "duration": 0.0, "poster": null, "streams": "1.00:1mhciX6XfAo", "saveStateEnabled": false, "speed": null, "transcriptLanguage": "en", "captionDataDir": null, "completionEnabled": false, "autoAdvance": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L06k/handler/xmodule_handler/save_user_state", "transcriptLanguages": {"en": "English"}}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L06k"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@neural_networks_2_regularization_vert" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="vertical" data-runtime-version="1" data-init="VerticalStudentView" data-has-score="False"> <h2 class="hd hd-2 unit-title">Regularization</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_2_regularization"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-graded="False" data-request-token="e87d44ac560611f0ad310e2404363195" data-runtime-class="LmsRuntime" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@neural_networks_2_regularization" data-course-id="course-v1:MITx+6.036+1T2019" data-block-type="html" data-runtime-version="1" data-init="XBlockToXModuleShim" data-has-score="False"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> So far, we have only considered optimizing loss on the training data as our objective for neural network training. But, as we have discussed before, there is a risk of overfitting if we do this. The pragmatic fact is that, in current deep neural networks, which tend to be very large and to be trained with a large amount of data, overfitting is not a huge problem. This runs counter to our current theoretical understanding and the study of this question is a hot area of research. Nonetheless, there are several strategies for regularizing a neural network, and they can sometimes be important. </p><p><h3>Methods related to ridge regression</h3></p><p> One group of strategies can, interestingly, be shown to have similar effects: early stopping, weight decay, and adding noise to <span options="" class="marginote"><span class="marginote_desc" style="display:none">Result is due to Bishop, described in his textbook and here <tt class="tt"><small class="scriptsize">doi.org/10.1162/ neco.1995.7.1.108</small></tt>.</span><span>the training data. </span></span> </p><p> Early stopping is the easiest to implement and is in fairly common use. The idea is to train on your training set, but at every <em>epoch</em> (pass through the whole training set, or possibly more frequently), evaluate the loss of the current [mathjaxinline]W[/mathjaxinline] on a <em>validation set</em>. It will generally be the case that the loss on the training set goes down fairly consistently with each iteration, the loss on the validation set will initially decrease, but then begin to increase again. Once you see that the validation loss is systematically increasing, you can stop training and return the weights that had the lowest validation error. </p><p> Another common strategy is to simply penalize the norm of all the weights, as we did in ridge regression. This method is known as <em>weight decay</em>, because when we take the gradient of the objective </p><table id="a0000000034" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]J(W) = \sum _{i = 1}^{n}\text {Loss}(\text {NN}(x^{(i)}), y^{(i)}; W) + \lambda \| W\| ^2[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> we end up with an update of the form </p><table id="a0000000035" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000036"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle W_ t[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1} - \eta \left(\left(\nabla _{W}\text {Loss}(\text {NN}(x^{(i)}), y^{(i)}; W_{t-1})\right) + \lambda W_{t-1}\right)[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000037"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = W_{t-1}(1 - \lambda \eta ) - \eta \left(\nabla _{W}\text {Loss}(\text {NN}(x^{(i)}), y^{(i)}; W_{t-1})\right) \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> This rule has the form of first “decaying" [mathjaxinline]W_{t-1}[/mathjaxinline] by a factor of [mathjaxinline](1 - \lambda \eta )[/mathjaxinline] and then taking a gradient step. </p><p> Finally, the same effect can be achieved by perturbing the [mathjaxinline]x^{(i)}[/mathjaxinline] values of the training data by adding a small amount of zero-mean normally distributed noise before each gradient computation. It makes intuitive sense that it would be more difficult for the network to overfit to particular training data if they are changed slightly on each training step. </p><p><h3>Dropout</h3> Dropout is a regularization method that was designed to work with deep neural networks. The idea behind it is, rather than perturbing the data every time we train, we'll perturb the network! We'll do this by randomly, on each training step, selecting a set of units in each layer and prohibiting them from participating. Thus, all of the units will have to take a kind of “collective" responsibility for getting the answer right, and will not be able to rely on any small subset of the weights to do all the necessary computation. This tends also to make the network more robust to data perturbations. </p><p> During the training phase, for each training example, for each unit, randomly with probability [mathjaxinline]p[/mathjaxinline] temporarily set [mathjaxinline]a^{\ell }_ j := 0[/mathjaxinline]. There will be no contribution to the output and no gradient update for the associated unit. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> Be sure you understand why, when using <i class="sc">sgd</i>, setting an activation value to 0 will cause that unit's weights not to be updated on that iteration. </span> <br/></p><p> When we are done training and want to use the network to make predictions, we multiply all weights by [mathjaxinline]p[/mathjaxinline] to achieve the same average activation levels. </p><p> Implementing dropout is easy! In the forward pass during training, we let </p><table id="a0000000038" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]a^{\ell } = f(z^{\ell }) * d^{\ell }[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]*[/mathjaxinline] denotes component-wise product and [mathjaxinline]d^{\ell }[/mathjaxinline] is a vector of [mathjaxinline]0[/mathjaxinline]'s and [mathjaxinline]1[/mathjaxinline]'s drawn randomly with probability [mathjaxinline]p[/mathjaxinline]. The backwards pass depends on [mathjaxinline]a^{\ell }[/mathjaxinline], so we do not need to make any further changes to the algorithm. </p><p> It is common to set [mathjaxinline]p[/mathjaxinline] to [mathjaxinline]0.5[/mathjaxinline], but this is something one might experiment with to get good results on your problem and data. </p><p><h3>Batch Normalization</h3></p><p> A more modern alternative to dropout, which tends to achieve better performance, is <span options="" class="marginote"><span class="marginote_desc" style="display:none">For more details see <tt class="tt"><small class="scriptsize">arxiv.org/abs/1502.03167</small></tt>.</span><span><em>batch normalization</em>. </span></span> It was originally developed to address a problem of <em>covariate shift</em>: that is, if you consider the second layer of a two-layer neural network, the distribution of its input values is changing over time as the first layer's weights change. Learning when the input distribution is changing is extra difficult: you have to change your weights to improve your predictions, but also just to compensate for a change in your inputs (imagine, for instance, that the magnitude of the inputs to your layer is increasing over time—then your weights will have to decrease, just to keep your predictions the same). </p><p> So, when training with mini-batches, the idea is to <em>standardize</em> the input values for each mini-batch, just in the way that we did it in section of chapter 4, subtracting off the mean and dividing by the standard deviation of each input dimension. This means that the scale of the inputs to each layer remains the same, no matter how the weights in previous layers change. However, this somewhat complicates matters, because the computation of the weight updates will need to take into account that we are performing this transformation. In the modular view, batch normalization can be seen as a module that is applied to [mathjaxinline]z^ l[/mathjaxinline], interposed after the product with [mathjaxinline]W^ l[/mathjaxinline] and before input to [mathjaxinline]f^ l[/mathjaxinline]. </p><p> Batch normalization ends up having a regularizing effect for similar reasons that adding noise and dropout do: each mini-batch of data ends up being mildly perturbed, which prevents the network from exploiting very particular values of the data points. </p><p> Let's think of the batch-norm layer as taking [mathjaxinline]z^ l[/mathjaxinline] as input and producing an output [mathjaxinline]\widehat{Z}[/mathjaxinline] as output. But now, instead of thinking of [mathjaxinline]Z^ l[/mathjaxinline] as an [mathjaxinline]n^ l \times 1[/mathjaxinline] vector, we have to explicitly think about handling a mini-batch of data of size [mathjaxinline]K[/mathjaxinline], all at once, so [mathjaxinline]Z^ l[/mathjaxinline] will be [mathjaxinline]n^ l \times K[/mathjaxinline], and so will the output [mathjaxinline]\widehat{Z}^ l[/mathjaxinline]. </p><p> Our first step will be to compute the <em>batchwise</em> mean and standard deviation. Let [mathjaxinline]\mu ^ l[/mathjaxinline] be the [mathjaxinline]n^ l \times 1[/mathjaxinline] vector where </p><table id="a0000000039" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\mu ^ l_ i = \frac{1}{K} \sum _{j = 1}^ K Z^ l_{ij}\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> and let [mathjaxinline]\sigma ^ l[/mathjaxinline] be the [mathjaxinline]n^ l \times 1[/mathjaxinline] vector where </p><table id="a0000000040" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\sigma ^ l_ i = \sqrt {\frac{1}{K} \sum _{j = 1}^ K (Z^ l_{ij} - \mu _ i)^2}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> The basic normalized version of our data would be a matrix, element [mathjaxinline](i, j)[/mathjaxinline] of which is </p><table id="a0000000041" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\overline{Z}^ l_{ij} = \frac{Z^ l_{ij} - \mu ^ l_ i}{\sigma ^ l_ i + \epsilon }\; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]\epsilon[/mathjaxinline] is a very small constant to guard against division by zero. However, if we let these be our [mathjaxinline]\widehat{Z}[/mathjaxinline] values, we really are forcing something too strong on our data—our goal was to normalize across the data batch, but not necessarily force the output values to have exactly mean 0 and standard deviation 1. So, we will give the layer the “opportunity" to shift and scale the outputs by adding new weights to the layer. These weights are [mathjaxinline]G^ l[/mathjaxinline] and [mathjaxinline]B^ l[/mathjaxinline], each of which is an [mathjaxinline]n^ l \times 1[/mathjaxinline] vector. Using the weights, we define the final output to be </p><table id="a0000000042" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\widehat{Z}^ l_{ij} = G^ l_ i \overline{Z}^ l_{ij} + B^ l_ i\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> That's the forward pass. Whew! </p><p> Now, for the backward pass, we have to do two things: given [mathjaxinline]\partial L / \partial \widehat{Z}^ l[/mathjaxinline], </p><ul class="itemize"><li><p> Compute [mathjaxinline]\partial L / \partial Z^ l[/mathjaxinline] for back-propagation, and </p></li><li><p> Compute [mathjaxinline]\partial L / \partial G^ l[/mathjaxinline] and [mathjaxinline]\partial L / \partial B^ l[/mathjaxinline] for gradient updates of the weights in this layer. </p></li></ul><p> Schematically </p><table id="a0000000043" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial L}{\partial B} = \frac{\partial L}{\partial \widehat{Z}}\frac{\partial \widehat{Z}}{\partial B}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> It's hard to think about these derivatives in matrix terms, so we'll see how it works for the components. [mathjaxinline]B_ i[/mathjaxinline] contributes to [mathjaxinline]\widehat{Z}_{ij}[/mathjaxinline] for all data points [mathjaxinline]j[/mathjaxinline] in the batch. So </p><table id="a0000000044" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000045"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial L}{\partial B_ i}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _ j \frac{\partial L}{\partial \widehat{Z}_{ij}} \frac{\partial \widehat{Z}_{ij}}{\partial B_ i}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000046"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _ j \frac{\partial L}{\partial \widehat{Z}_{ij}}\; \; ,[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Similarly, [mathjaxinline]G_ i[/mathjaxinline] contributes to [mathjaxinline]\widehat{Z}_{ij}[/mathjaxinline] for all data points [mathjaxinline]j[/mathjaxinline] in the batch. So </p><table id="a0000000047" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000048"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial L}{\partial G_ i}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _ j \frac{\partial L}{\partial \widehat{Z}_{ij}} \frac{\partial \widehat{Z}_{ij}}{\partial G_ i}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000049"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \sum _ j \frac{\partial L}{\partial \widehat{Z}_{ij}} \overline{Z}_{ij}\; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Now, let's figure out how to do backprop. We can start schematically: </p><table id="a0000000050" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial L}{\partial Z} = \frac{\partial L}{\partial \widehat{Z}} \frac{\partial \widehat{Z}}{\partial Z}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> And because dependencies only exist across the batch, but not across the unit outputs, </p><table id="a0000000051" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial L}{\partial Z_{ij}} = \sum _{k=1}^ K\frac{\partial L}{\partial \widehat{Z}_{ik}} \frac{\partial \widehat{Z}_{ik}}{\partial Z_{ij}}\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> The next step is to note that </p><table id="a0000000052" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000053"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial \widehat{Z}_{ik}}{\partial Z_{ij}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{\partial \widehat{Z}_{ik}}{\partial \overline{Z}_{ik}} \frac{\partial \overline{Z}_{ik}}{\partial Z_{ij}}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000054"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = G_ i \frac{\partial \overline{Z}_{ik}}{\partial Z_{ij}}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> And now that </p><table id="a0000000055" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \overline{Z}_{ik}}{\partial Z_{ij}} = \left(\delta _{jk} - \frac{\partial \mu _ i}{\partial Z_{ij}}\right) \frac{1}{\sigma _ i} - \frac{Z_{ik} - \mu _ i}{\sigma _ i^2} \frac{\partial \sigma _ i}{\partial Z_{ij}} \; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> where [mathjaxinline]\delta _{jk} = 1[/mathjaxinline] if [mathjaxinline]j = k[/mathjaxinline] and [mathjaxinline]\delta _{jk} = 0[/mathjaxinline] otherwise. Getting close! We need two more small parts: </p><table id="a0000000056" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000057"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial \mu _ i}{\partial Z_{ij}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{1}{K}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000058"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial \sigma _ i}{\partial Z_{ij}}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{Z_{ij} - \mu _ i}{K \sigma _ i}[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Putting the whole crazy thing together, we get </p><table id="a0000000059" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial L}{\partial Z_{ij}} = \sum _{k=1}^ K\frac{\partial L}{\partial \widehat{Z}_{ik}} G_ i\frac{1}{K \sigma _ i}\left(\delta _{jk}K-1 - \frac{(Z_{ik} - \mu _ i)(Z_{ij} - \mu _ i)}{\sigma _ i^2} \right)[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/9c36c444e5df10eef7ce4d052e4a2ed1/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Making_NN_s_Work.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:06 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>