Introduction to Machine Learning

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_notes" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Notes – Chapter 6: Gradient Descent</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_notes_top"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_notes_top" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> You can sequence through the Gradient Descent lecture video and note segments (go to Next page). Currently, the lectures videos are segments from a previous semester discussing gradient descent in the context of margin maximization, and so do not match up perfectly with the Fall 2019 content. The revised Fall 2019 lecture video will be posted here one or two days after the live lecture. </p><p> You can download the revised (Fall 2019) <a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Chapter 6: Gradient Descent</a> notes as a PDF file. </p> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L03g_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Lecture: Gradient descent optimization - algorithm in one dimension</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g" data-block-type="video" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Gradient descent optimization - algorithm in one dimension</h3> <div id="video_MIT6036L03g" class="video closed" data-metadata='{"transcriptLanguages": {"en": "English"}, "lmsRootURL": "https://openlearninglibrary.mit.edu", "captionDataDir": null, "ytTestTimeout": 1500, "saveStateEnabled": false, "start": 0.0, "autoAdvance": false, "ytApiUrl": "https://www.youtube.com/iframe_api", "ytMetadataEndpoint": "", "recordedYoutubeIsAvailable": true, "speed": null, "generalSpeed": 1.0, "streams": "1.00:8g2cFGnVyS0", "completionEnabled": false, "end": 0.0, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g/handler/transcript/translation/__lang__", "prioritizeHls": false, "duration": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g/handler/publish_completion", "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g/handler/transcript/available_translations", "showCaptions": "true", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03g/handler/xmodule_handler/save_user_state", "poster": null, "autoplay": false, "transcriptLanguage": "en", "autohideHtml5": false, "savedVideoPosition": 0.0, "sources": []}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L03g"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_top_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Gradient descent - introduction</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_top"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_top" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> In the previous chapter, we showed how to describe an interesting objective function for machine learning, but we need a way to find the optimal [mathjaxinline]\Theta ^* = {\rm arg}\min _{\Theta } J(\Theta )[/mathjaxinline]. There is an enormous, fascinating, literature on the mathematical and algorithmic <span options="" class="marginote"><span class="marginote_desc" style="display:none">Which you should consider studying some day!</span><span>foundations of optimization</span></span> , but for this class, we will consider one of the simplest methods, called <em>gradient descent.</em> </p><p> Intuitively, in one or two dimensions, we can easily think of [mathjaxinline]J(\Theta )[/mathjaxinline] as defining a surface over [mathjaxinline]\Theta[/mathjaxinline]; that same idea extends to higher dimensions. Now, our objective is to find the [mathjaxinline]\Theta[/mathjaxinline] value at the lowest point on that surface. One way to think about gradient descent is that you start at some arbitrary point on the surface, look to see in which direction the “hill" goes down most steeply, take a small step in that direction, determine the direction of steepest descent from where you are, take another small step, <span options="" class="marginote"><span class="marginote_desc" style="display:none">Here's a very old-school humorous description of gradient descent and other optimization algorithms using analogies involving kangaroos:<br/><small class="scriptsize"><tt class="tt">ftp://ftp.sas.com/ pub/neural/kangaroos.txt</tt></small></span><span>etc. </span></span> </p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Saturday November 16, 2019; 07:31:08 PM (revision f808f068e)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L03h_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Lecture: Gradient descent optimization - local optima</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h" data-block-type="video" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Gradient descent optimization - local optima</h3> <div id="video_MIT6036L03h" class="video closed" data-metadata='{"transcriptLanguages": {"en": "English"}, "lmsRootURL": "https://openlearninglibrary.mit.edu", "captionDataDir": null, "ytTestTimeout": 1500, "saveStateEnabled": false, "start": 0.0, "autoAdvance": false, "ytApiUrl": "https://www.youtube.com/iframe_api", "ytMetadataEndpoint": "", "recordedYoutubeIsAvailable": true, "speed": null, "generalSpeed": 1.0, "streams": "1.00:gaIOqD3Btmo", "completionEnabled": false, "end": 0.0, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h/handler/transcript/translation/__lang__", "prioritizeHls": false, "duration": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h/handler/publish_completion", "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h/handler/transcript/available_translations", "showCaptions": "true", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03h/handler/xmodule_handler/save_user_state", "poster": null, "autoplay": false, "transcriptLanguage": "en", "autohideHtml5": false, "savedVideoPosition": 0.0, "sources": []}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L03h"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_one_dimension_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">One dimension</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_one_dimension"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_one_dimension" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> We will start by considering gradient descent in one dimension. Assume [mathjaxinline]\Theta \in \mathbb {R}[/mathjaxinline], and that we know both [mathjaxinline]J(\Theta )[/mathjaxinline] and its first derivative with respect to [mathjaxinline]\Theta[/mathjaxinline], [mathjaxinline]J'(\Theta )[/mathjaxinline]. Here is pseudo-code for gradient descent on an arbitrary function [mathjaxinline]f[/mathjaxinline]. Along with [mathjaxinline]f[/mathjaxinline] and [mathjaxinline]f'[/mathjaxinline], we have to specify the initial value for parameter [mathjaxinline]\Theta[/mathjaxinline], a <em>step-size</em> parameter [mathjaxinline]\eta[/mathjaxinline], and an <em>accuracy</em> parameter [mathjaxinline]\epsilon[/mathjaxinline]: </p><p><img src="/assets/courseware/v1/d4745a744cbd1e290e98f06f71ff0e06/asset-v1:MITx+6.036+1T2019+type@asset+block/images_gradient_descent_one_dimension_codebox_1-crop.png" width="405"/></p><p> Note that there are many other reasonable ways to decide to terminate, including when [mathjaxinline]\left| \Theta ^{(t)} - \Theta ^{(t-1)} \right| <\epsilon[/mathjaxinline] or when [mathjaxinline]\left|f(\Theta ^{(t)}) - f(\Theta ^{(t-1)}) \right| <\epsilon[/mathjaxinline]. </p><p><b class="bf">Theorem:</b><em> If [mathjaxinline]J[/mathjaxinline] is <span options="" class="marginote"><span class="marginote_desc" style="display:none">A function is convex if the line segment between any two points on the graph of the function lies above or on the graph.</span><span>convex,</span></span> for any desired accuracy [mathjaxinline]\epsilon[/mathjaxinline], there is some step size [mathjaxinline]\eta[/mathjaxinline] such that gradient descent will converge to within [mathjaxinline]\epsilon[/mathjaxinline] of the optimal [mathjaxinline]\Theta[/mathjaxinline]. <span class="rm"/></em> However, we must be careful when choosing the step size to prevent slow convergence, oscillation around the minimum, or divergence. </p><p> The following plot illustrates a convex function [mathjaxinline]f(x) = (x - 2)^2[/mathjaxinline], starting gradient descent at [mathjaxinline]\theta _{init} = 4.0[/mathjaxinline] with a step-size of [mathjaxinline]1/2[/mathjaxinline]. It is very well-behaved! </p><center><p><img src="/assets/courseware/v1/b96fd14cf73ed2de2861ccc7c649f4cf/asset-v1:MITx+6.036+1T2019+type@asset+block/images_gradient_descent_one_dimension_tikzpicture_1-crop.png" width="465"/></p></center><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">What happens in this example with very small [mathjaxinline]\eta[/mathjaxinline]? With very big [mathjaxinline]\eta[/mathjaxinline]?</span> <br/></p><p> If [mathjaxinline]J[/mathjaxinline] is non-convex, where gradient descent converges to depends on [mathjaxinline]\theta _{\it init}[/mathjaxinline]. When it reaches a value of [mathjaxinline]\theta[/mathjaxinline] where [mathjaxinline]f'(\theta ) = 0[/mathjaxinline] and [mathjaxinline]f"(\theta ) > 0[/mathjaxinline], but it is not a minimum of the function, it is called a <em>local minimum</em> or <em>local optimum</em>. </p><center><p><img src="/assets/courseware/v1/0dd5061bea79536215a00e8429d970c3/asset-v1:MITx+6.036+1T2019+type@asset+block/images_gradient_descent_one_dimension_tikzpicture_2-crop.png" width="465"/></p></center><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:36 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L03j_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Lecture: Gradient descent optimization - algorithm in multiple dimensions</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j" data-block-type="video" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Gradient descent optimization - algorithm in multiple dimensions</h3> <div id="video_MIT6036L03j" class="video closed" data-metadata='{"transcriptLanguages": {"en": "English"}, "lmsRootURL": "https://openlearninglibrary.mit.edu", "captionDataDir": null, "ytTestTimeout": 1500, "saveStateEnabled": false, "start": 0.0, "autoAdvance": false, "ytApiUrl": "https://www.youtube.com/iframe_api", "ytMetadataEndpoint": "", "recordedYoutubeIsAvailable": true, "speed": null, "generalSpeed": 1.0, "streams": "1.00:0Y4jQJwVW1M", "completionEnabled": false, "end": 0.0, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j/handler/transcript/translation/__lang__", "prioritizeHls": false, "duration": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j/handler/publish_completion", "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j/handler/transcript/available_translations", "showCaptions": "true", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03j/handler/xmodule_handler/save_user_state", "poster": null, "autoplay": false, "transcriptLanguage": "en", "autohideHtml5": false, "savedVideoPosition": 0.0, "sources": []}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L03j"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_multiple_dimensions_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Multiple dimensions</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_multiple_dimensions"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_multiple_dimensions" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> The extension to the case of multi-dimensional [mathjaxinline]\Theta[/mathjaxinline] is straightforward. Let's assume [mathjaxinline]\Theta \in \mathbb {R}^ m[/mathjaxinline], so [mathjaxinline]J: \mathbb {R}^ m \rightarrow \mathbb {R}[/mathjaxinline]. The gradient of [mathjaxinline]J[/mathjaxinline] with respect to [mathjaxinline]\Theta[/mathjaxinline] is </p><table id="a0000000018" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\nabla _\Theta J = \begin{bmatrix} \partial J / \partial \Theta _1 \\ \vdots \\ \partial J / \partial \Theta _ m \end{bmatrix}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> The algorithm remains the same, except that the update step in line 5 becomes </p><table id="a0000000019" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\Theta ^{(t)} = \Theta ^{(t-1)} - \eta \nabla _\Theta J[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> and we have to change the termination criterion. The easiest thing is to replace the test in line 6 with [mathjaxinline]\left|f(\Theta ^{(t)}) - f(\Theta ^{(t-1)}) \right| < \epsilon[/mathjaxinline], which is sensible no matter the dimensionality of [mathjaxinline]\Theta[/mathjaxinline]. </p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Download this chapter as a PDF file</a></p><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:36 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L03k_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Lecture: Gradient descent optimization - parameters and demo</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k"> <div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k" data-block-type="video" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "Video"} </script> <h3 class="hd hd-2">Lecture: Gradient descent optimization - parameters and demo</h3> <div id="video_MIT6036L03k" class="video closed" data-metadata='{"transcriptLanguages": {"en": "English"}, "lmsRootURL": "https://openlearninglibrary.mit.edu", "captionDataDir": null, "ytTestTimeout": 1500, "saveStateEnabled": false, "start": 0.0, "autoAdvance": false, "ytApiUrl": "https://www.youtube.com/iframe_api", "ytMetadataEndpoint": "", "recordedYoutubeIsAvailable": true, "speed": null, "generalSpeed": 1.0, "streams": "1.00:nD90J-FuoEw", "completionEnabled": false, "end": 0.0, "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k/handler/transcript/translation/__lang__", "prioritizeHls": false, "duration": 0.0, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k/handler/publish_completion", "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k/handler/transcript/available_translations", "showCaptions": "true", "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L03k/handler/xmodule_handler/save_user_state", "poster": null, "autoplay": false, "transcriptLanguage": "en", "autohideHtml5": false, "savedVideoPosition": 0.0, "sources": []}' data-bumper-metadata='null' data-autoadvance-enabled="False" data-poster='null' tabindex="-1" > <div class="focus_grabber first"></div> <div class="tc-wrapper"> <div class="video-wrapper"> <span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span> <span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span> <div class="video-player-pre"></div> <div class="video-player"> <div id="MIT6036L03k"></div> <h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4> <h4 class="hd hd-4 video-hls-error is-hidden"> Your browser does not support this video format. Try using a different browser. </h4> </div> <div class="video-player-post"></div> <div class="closed-captions"></div> <div class="video-controls is-hidden"> <div> <div class="vcr"><div class="vidtime">0:00 / 0:00</div></div> <div class="secondary-controls"></div> </div> </div> </div> </div> <div class="focus_grabber last"></div> </div> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_application_to_logistic_regression_objective_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Application to logistic regression objective</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_application_to_logistic_regression_objective"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_application_to_logistic_regression_objective" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> We can now solve the optimization problem for our linear logistic classifier as formulated in chapter <ref/>. We begin by stating the objective and the gradient necessary for doing gradient descent. In our problem where we are considering linear separators, the entire parameter vector is described by parameter vector [mathjaxinline]\theta[/mathjaxinline] and scalar [mathjaxinline]\theta _0[/mathjaxinline] and so we will have to adjust them both and compute gradients of [mathjaxinline]J[/mathjaxinline] with respect to each of them. The objective and gradient (note that we have replaced the constant [mathjaxinline]\lambda[/mathjaxinline] with [mathjaxinline]\frac{\lambda }{2}[/mathjaxinline] for convenience), are, <span options="" class="marginote"><span class="marginote_desc" style="display:none">The following step requires passing familiarity with matrix derivatives. A foolproof way of computing them is to compute partial derivative of [mathjaxinline]J[/mathjaxinline] with respect to each component [mathjaxinline]\theta _ i[/mathjaxinline] of [mathjaxinline]\theta[/mathjaxinline].</span><span>note</span></span> letting [mathjaxinline]g^{(i)} = \sigma (\theta ^ Tx^{(i)}+\theta _0)[/mathjaxinline] </p><table id="a0000000019" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000020"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle J_\text {lr}(\theta ,\theta _0)[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{1}{n}\sum _{i=1}^ n \mathcal{L}_\text {nll} (g^{(i)}, y^{(i)}) + \frac{\lambda }{2}\left\lVert \theta \right\rVert ^2[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000021"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \nabla _\theta J_\text {lr}(\theta , \theta _0)[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{1}{n}\sum _{i=1}^ n \left(g^{(i)} - y^{(i)}\right) x^{(i)} + \lambda \theta[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000022"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none"> [mathjaxinline]\displaystyle \frac{\partial J_\text {lr}(\theta , \theta _0)}{\partial \theta _0}[/mathjaxinline] </td><td style="vertical-align:middle; text-align:left; border:none"> [mathjaxinline]\displaystyle = \frac{1}{n}\sum _{i=1}^ n \left(g^{(i)} - y^{(i)} \right) \; \; .[/mathjaxinline] </td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p> Note that [mathjaxinline]\nabla _\theta J[/mathjaxinline] will be of shape [mathjaxinline]d \times 1[/mathjaxinline] and [mathjaxinline]\frac{\partial J}{\partial \theta _0}[/mathjaxinline] will be a scalar since we have separated [mathjaxinline]\theta _0[/mathjaxinline] from [mathjaxinline]\theta[/mathjaxinline] here. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Convince yourself that the dimensions of all these quantities are correct, under the assumption that [mathjaxinline]\theta[/mathjaxinline] is [mathjaxinline]d \times 1[/mathjaxinline]. How does [mathjaxinline]d[/mathjaxinline] relate to [mathjaxinline]m[/mathjaxinline] as discussed for [mathjaxinline]\Theta[/mathjaxinline] in the previous section? </span> <br/></p><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Compute [mathjaxinline]\nabla _\theta \left\lVert \theta \right\rVert ^2[/mathjaxinline] by finding the vector of partial derivatives [mathjaxinline](\partial \left\lVert \theta \right\rVert ^2 / \partial \theta _1, \ldots , \partial \left\lVert \theta \right\rVert ^2 / \partial \theta _ d)[/mathjaxinline]. What is the shape of [mathjaxinline]\nabla _\theta \left\lVert \theta \right\rVert ^2[/mathjaxinline]? </span> <br/></p><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Compute [mathjaxinline]\nabla _\theta \mathcal{L}_\text {nll}( \sigma (\theta ^ T x + \theta _0), y)[/mathjaxinline] by finding the vector of partial derivatives [mathjaxinline](\partial \mathcal{L}_\text {nll}( \sigma (\theta ^ T x + \theta _0), y)/ \partial \theta _1, \ldots , \partial \mathcal{L}_\text {nll}( \sigma (\theta ^ T x + \theta _0), y) / \partial \theta _ d)[/mathjaxinline]. </span> <br/></p><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Use these last two results to verify our derivation above.</span> <br/></p><p> Putting everything together, our gradient descent algorithm for logistic regression becomes </p><p><img src="/assets/courseware/v1/fba48158c93cbfd371e6bafb9ba64e9c/asset-v1:MITx+6.036+1T2019+type@asset+block/images_gradient_descent_application_to_logistic_regression_objective_codebox_1-crop.png" width="829"/></p><p> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Is it okay that [mathjaxinline]\lambda[/mathjaxinline] doesn't appear in line 7?</span> <br/></p><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Saturday November 16, 2019; 07:31:08 PM (revision f808f068e)</center></span></span> </div> </div> </div> </div>

<div class="xblock xblock-public_view xblock-public_view-vertical" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@gradient_descent_stochastic_gradient_descent_vert" data-block-type="vertical" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="VerticalStudentView"> <h2 class="hd hd-2 unit-title">Stochastic Gradient Descent</h2> <div class="vert-mod"> <div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_stochastic_gradient_descent"> <div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-runtime-version="1" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@gradient_descent_stochastic_gradient_descent" data-block-type="html" data-graded="False" data-has-score="False" data-course-id="course-v1:MITx+6.036+1T2019" data-runtime-class="LmsRuntime" data-request-token="f754c44b4bea11ef89040e6e8f39bb05" data-init="XBlockToXModuleShim"> <script type="json/xblock-args" class="xblock-json-init-args"> {"xmodule-type": "HTMLModule"} </script> <p> When the form of the gradient is a sum, rather than take one big(ish) step in the direction of the gradient, we can, instead, <span options="" class="marginote"><span class="marginote_desc" style="display:none">The word “stochastic" means probabilistic, or random; so does “aleatoric," which is a very cool word. Look up aleatoric music, sometime.</span><span>randomly </span></span> select one term of the sum, and take a very small step in that direction. This seems sort of crazy, but remember that all the little steps would average out to the same direction as the big step if you were to stay in one place. Of course, you're not staying in that place, so you move, in expectation, in the direction of the gradient. </p><p> Most objective functions in machine learning can end up being written as a sum over data points, in which case, stochastic gradient descent (<i class="sc">sgd</i>) is implemented by picking a data point randomly out of the data set, computing the gradient as if there were only that one point in the data set, and taking a small step in the negative direction. </p><p> Let's assume our objective has the form </p><table id="a0000000026" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]f(\Theta ) = \sum _{i = 1}^ n f_ i(\Theta )\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p> Here is pseudocode for applying <i class="sc">sgd</i> to an objective [mathjaxinline]f[/mathjaxinline]; it assumes we know the form of [mathjaxinline]\nabla _\Theta f_ i[/mathjaxinline] for all [mathjaxinline]i[/mathjaxinline] in [mathjaxinline]1\ldots n[/mathjaxinline]: </p><p><img src="/assets/courseware/v1/b53534584e1f90438fca9eb88f628f29/asset-v1:MITx+6.036+1T2019+type@asset+block/images_gradient_descent_stochastic_gradient_descent_codebox_1-crop.png" width="648"/></p><p> Note that now instead of a fixed value of [mathjaxinline]\eta[/mathjaxinline], it is indexed by the iteration of the algorithm, [mathjaxinline]t[/mathjaxinline]. For <i class="sc">sgd</i> to converge to a local optimum as [mathjaxinline]t[/mathjaxinline] increases, the step size has to decrease as a function of time. </p><p><b class="bf">Theorem:</b><em> If [mathjaxinline]J[/mathjaxinline] is convex, and [mathjaxinline]\eta (t)[/mathjaxinline] is a sequence satisfying <table id="a0000000027" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\sum _{t = 1}^{\infty }\eta (t) = \infty \; \; \text {and}\; \; \sum _{t = 1}^{\infty }\eta (t)^2 < \infty \; \; ,[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table> then SGD converges <span options="" class="marginote"><span class="marginote_desc" style="display:none">We have left out some gnarly conditions in this theorem, and note that <em>almost surely</em> is a technical term from probability theory, not a lack of courage on the part of the author.</span><span><em>almost surely</em></span></span> to the optimal [mathjaxinline]\Theta[/mathjaxinline]. <span class="rm"/></em></p><p> One “legal" way of setting the step size is to make [mathjaxinline]\eta (t) = 1/t[/mathjaxinline] but people often use rules that decrease more slowly, and so don't strictly satisfy the criteria for convergence. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> If you start a long way from the optimum, would making [mathjaxinline]\eta (t)[/mathjaxinline] decrease more slowly tend to make you move more quickly or more slowly to the optimum? </span> <br/></p><p> There are two intuitions for why <i class="sc">sgd</i> might be a better choice algorithmically than regular <i class="sc">gd</i> (which is sometimes called <em>batch</em> <i class="sc">gd</i> (<i class="sc">bgd</i>)): </p><ul class="itemize"><li><p> If your [mathjaxinline]f[/mathjaxinline] is actually non-convex, but has many shallow local optima that might trap <i class="sc">bgd</i>, then taking <em>samples</em> from the gradient at some point [mathjaxinline]\Theta[/mathjaxinline] might “bounce" you around the landscape and out of the local optima. </p></li><li><p> Sometimes, optimizing [mathjaxinline]f[/mathjaxinline] really well is not what we want to do, because it might overfit the training set; so, in fact, although <i class="sc">sgd</i> might not get lower training error than <i class="sc">bgd</i>, it might result in lower test error. </p></li></ul><p> <br/></p><p> <br/></p><p><a href="/assets/courseware/v1/d81d9ec0bd142738b069ce601382fdb7/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Gradient_Descent.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:28:36 PM (revision 4f166135)</center></span></span> </div> </div> </div> </div>