<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@convolutional_neural_networks_notes" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Notes – Chapter 9: Convolutional Neural Networks</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_notes_top">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-block-type="html" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_notes_top" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
You can sequence through the Convolutional Neural Networks lecture video and note segments (go to Next page). </p><p><a href="/assets/courseware/v1/cda92ed2c6672271916e8cb8974af568/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_conv_nets_slides.pdf" target="_blank">F19 Lecture Slides</a> are also available. </p><p>
You can also (or alternatively) download the <a href="/assets/courseware/v1/41c7c4a6141b76b324055d56387570c0/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Convolutional_Neural_Networks.pdf" target="_blank">Chapter 9: Convolutional Neural Networks</a> notes as a PDF file. </p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07a_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - convolutional neural networks - intro</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - convolutional neural networks - intro</h3>
<div
id="video_MIT6036L07a"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:hm7lfUg3obs", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07a/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07a"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@convolutional_neural_networks_top_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Introduction to CNNs</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_top">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-block-type="html" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_top" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
So far, we have studied what are called <em>fully connected</em> neural networks, in which all of the units at one layer are connected to all of the units in the next layer. This is a good arrangement when we don't know anything about what kind of mapping from inputs to outputs we will be asking the network to learn to approximate. But if we <em>do</em> know something about our problem, it is better to build it into the structure of our neural network. Doing so can save computation time and significantly diminish the amount of training data required to arrive at a solution that generalizes robustly. </p><p>
One very important application domain of neural networks, where the methods have achieved an enormous amount of success in recent years, is signal processing. Signals might be spatial (in two-dimensional camera images or three-dimensional depth or CAT scans) or temporal (speech or music). If we know that we are addressing a signal-processing problem, we can take advantage of <em>invariant</em> properties of that problem. In this chapter, we will focus on two-dimensional spatial problems (images) but use one-dimensional ones as a simple example. Later, we will address temporal problems. </p><p>
Imagine that you are given the problem of designing and training a neural network that takes an image as input, and outputs a classification, which is positive if the image contains a cat and negative if it does not. An image is described as a two-dimensional array of <span options="" class="marginote"><span class="marginote_desc" style="display:none">A <em>pixel</em> is a “picture element."</span><span><em>pixels</em></span></span> , each of which may be represented by three integer values, encoding intensity levels in red, green, and blue color channels. </p><p>
There are two important pieces of prior structural knowledge we can bring to bear on this problem: </p><ul class="itemize"><li><p><b class="bf">Spatial locality:</b> The set of pixels we will have to take into consideration to find a cat will be near one another <span options="" class="marginote"><span class="marginote_desc" style="display:none">So, for example, we won't have to consider some combination of pixels in the four corners of the image, in order to see if they encode cat-ness.</span><span>in the image. </span></span> </p></li><li><p><b class="bf">Translation invariance:</b> The pattern of pixels that characterizes a cat is the same no matter where in the image <span options="" class="marginote"><span class="marginote_desc" style="display:none">Cats don't look different if they're on the left or the right side of the image.</span><span>the cat occurs. </span></span> </p></li></ul><p>
We will design neural network structures that take advantage of these properties. </p><p><br/> <br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/41c7c4a6141b76b324055d56387570c0/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Convolutional_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Thursday December 12, 2019; 09:33:44 PM (revision 4b592d7d7)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07b_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - one-dimensional filters</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - one-dimensional filters</h3>
<div
id="video_MIT6036L07b"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:oqMwT1P7u3Y", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07b/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07b"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07c_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - two-dimensional filters</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - two-dimensional filters</h3>
<div
id="video_MIT6036L07c"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:I0RZ7jF9_H4", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07c/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07c"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07d_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - a specific illustrative example filter</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - a specific illustrative example filter</h3>
<div
id="video_MIT6036L07d"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:RRRCff0w-84", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07d/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07d"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07e_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - convolutional neural network layers</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - convolutional neural network layers</h3>
<div
id="video_MIT6036L07e"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:b73F9gJ4JjE", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07e/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07e"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@convolutional_neural_networks_filters_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Filters</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_filters">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-block-type="html" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_filters" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
We begin by discussing <span options="" class="marginote"><span class="marginote_desc" style="display:none">Unfortunately in AI/ML/CS/Math, the word “filter" gets used in many ways: in addition to the one we describe here, it can describe a temporal process (in fact, our moving averages are a kind of filter) and even a somewhat esoteric algebraic structure.</span><span><em>image filters</em></span></span> . An image filter is a function that takes in a local spatial neighborhood of pixel values and detects the presence of some pattern in that data. </p><p>
Let's consider a very simple case to start, in which we have a 1-dimensional binary “image" and a filter [mathjaxinline]F[/mathjaxinline] of size two. The filter is a vector of two numbers, which we will move along the image, taking the dot product between the filter values and the image values at each step, and aggregating the outputs to produce a new image. </p><p>
Let [mathjaxinline]X[/mathjaxinline] be the original image, of size [mathjaxinline]d[/mathjaxinline]; then pixel [mathjaxinline]i[/mathjaxinline] of the the output image is specified by </p><table id="a0000000002" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]Y_ i = F \cdot (X_{i-1}, X_ i)\; \; .[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><p>
To ensure that the output image is also of dimension [mathjaxinline]d[/mathjaxinline], we will generally “pad" the input image with 0 values if we need to access pixels that are beyond the bounds of the input image. This process of applying the filter to the image to create a new image <span options="" class="marginote"><span class="marginote_desc" style="display:none">And filters are also sometimes called <em>convolutional kernels</em>.</span><span>is called “convolution."</span></span> </p><p>
If you are already familiar with what a convolution is, you might notice that this definition corresponds to what is often called a correlation and not to a convolution. Indeed, correlation and convolution refer to different operations in signal processing. However, in the neural networks literature, most libraries implement the correlation (as described in this chapter) but call it convolution. The distinction is not significant; in principle, if convolution is required to solve the problem, the network could learn the necessary weights. For a discussion of the difference between convolution and correlation and the conventions used in the literature you can read section 9.1 in this excellent book: <tt class="tt">https://www.deeplearningbook.org</tt>. </p><p>
Here is a concrete example. Let the filter [mathjaxinline]F_1 = (-1, +1)[/mathjaxinline]. Then given the first image below, we can convolve it with filter [mathjaxinline]F_1[/mathjaxinline] to obtain the second image. You can think of this filter as a detector for “left edges" in the original image—to see this, look at the places where there is a [mathjaxinline]1[/mathjaxinline] in the output image, and see what pattern exists at that position in the input image. Another interesting filter is [mathjaxinline]F_2 = (-1, +1, -1)[/mathjaxinline]. The third image below shows the result of convolving the first image with [mathjaxinline]F_2[/mathjaxinline]. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Convince yourself that filter [mathjaxinline]F_2[/mathjaxinline] can be understood as a detector for isolated positive pixels in the binary image.</span> <br/></p><p><img src="/assets/courseware/v1/4cf78c92e9865f39c6770c4db2052977/asset-v1:MITx+6.036+1T2019+type@asset+block/images_convolutional_neural_networks_filters_tikzpicture_1-crop.png" width="730"/></p><p>
Two-dimensional versions of filters like these are thought to be found in the visual cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural images. Computer vision people used to spend a lot of time hand-designing <em>filter banks</em>. A filter bank is a set of sets of filters, arranged as shown in the diagram below. </p><p><img src="/assets/courseware/v1/85b9e784b8b8c2c326dcf6a8dffceb09/asset-v1:MITx+6.036+1T2019+type@asset+block/images_convolutional_neural_networks_filters_tikzpicture_2-crop.png" width="903"/></p><p>
All of the filters in the first group are applied to the original image; if there are [mathjaxinline]k[/mathjaxinline] such filters, then the result is [mathjaxinline]k[/mathjaxinline] new images, which are called <em>channels</em>. Now imagine stacking all these new images up so that we have a cube of data, indexed by the original row and column indices of the image, as well as by the channel. The next set of filters in the filter bank will generally be <em>three-dimensional</em>: each one will be applied to a sub-range of the row and column indices of the image and to all of the channels. </p><p>
These 3D chunks of data are called <span options="" class="marginote"><span class="marginote_desc" style="display:none">We will use a popular piece of neural-network software called <em>Tensorflow</em> because it makes operations on tensors easy.</span><span><em>tensors</em>. </span></span> The algebra of tensors is fun, and a lot like matrix algebra, but we won't go into it in any detail. </p><p>
Here is a more complex example of two-dimensional filtering. We have two [mathjaxinline]3 \times 3[/mathjaxinline] filters in the first layer, [mathjaxinline]f_1[/mathjaxinline] and [mathjaxinline]f_2[/mathjaxinline]. You can think of each one as “looking" for three pixels in a row, [mathjaxinline]f_1[/mathjaxinline] vertically and [mathjaxinline]f_2[/mathjaxinline] horizontally. Assuming our input image is [mathjaxinline]n \times n[/mathjaxinline], then the result of filtering with these two filters an [mathjaxinline]n \times n \times 2[/mathjaxinline] tensor. Now we apply a tensor filter (hard to draw!) that “looks for" a combination of two horizontal and two vertical bars (now represented by individual pixels in the two channels), resulting in a single final [mathjaxinline]n \times n[/mathjaxinline] <span options="" class="marginote"><span class="marginote_desc" style="display:none">When we have a color image as input, we treat it as having 3 channels, and hence as an [mathjaxinline]n \times n \times 3[/mathjaxinline] tensor.</span><span>image. </span></span> </p><p><img src="/assets/courseware/v1/2f5fb3487c0b8f0ddae9e992bd830bfa/asset-v1:MITx+6.036+1T2019+type@asset+block/images_convolutional_neural_networks_filters_tikzpicture_3-crop.png" width="889"/></p><p>
We are going to design neural networks that have this structure. Each “bank" of the filter bank will correspond to a neural-network layer. The numbers in the individual filters will be the “weights" of the network, which we will train using gradient descent. What makes this interesting and powerful (and somewhat confusing at first) is that the same weights are used many many times in the computation of each layer. This <em>weight sharing</em> means that we can express a transformation on a large image with relatively few parameters; it also means we'll have to take care in figuring out exactly how to train it! </p><p>
We will define a filter layer [mathjaxinline]l[/mathjaxinline] <span options="" class="marginote"><span class="marginote_desc" style="display:none">For simplicity, we are assuming that all images and filters are square (having the same number of rows and columns). That is in no way necessary, but is usually fine and definitely simplifies our notation.</span><span>formally with: </span></span> </p><ul class="itemize"><li><p><em>number</em> of filters [mathjaxinline]m^ l[/mathjaxinline]; </p></li><li><p><em>size</em> of filters [mathjaxinline]k^ l \times k^ l \times m^{l-1}[/mathjaxinline]; </p></li><li><p><em>stride</em> [mathjaxinline]s^ l[/mathjaxinline] is the spacing at which we apply the filter to the image; in all of our examples so far, we have used a stride of 1, but if we were to “skip" and apply the filter only at odd-numbered indices of the image, then it would have a stride of two (and produce a resulting image of half the size); </p></li><li><p><em>input tensor size</em> [mathjaxinline]n^{l-1} \times n^{l-1} \times m^{l-1}[/mathjaxinline] </p></li></ul><p>
This layer will produces output tensor of size [mathjaxinline]n^ l \times n^ l \times m^ l[/mathjaxinline], where [mathjaxinline]n^ l = \lfloor n^{l-1} / s^ l \rfloor[/mathjaxinline]. The weights are the values defining the filter: there will be [mathjaxinline]m^ l[/mathjaxinline] different [mathjaxinline]k^ l \times k^ l \times m^{l-1}[/mathjaxinline] tensors of weight values. </p><p>
This may seem complicated, but we get a rich class of mappings that exploit image structure and have many fewer weights than a fully connected layer would. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> How many weights are in a convolutional layer specified as above?</span> <br/> <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF"> If we used a fully-connected layer with the same size inputs and outputs, how many weights would it have?</span> <br/></p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/41c7c4a6141b76b324055d56387570c0/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Convolutional_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:14 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07f_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - max pooling</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - max pooling</h3>
<div
id="video_MIT6036L07f"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:9QMhPxvgX7c", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07f/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07f"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@convolutional_neural_networks_max_pooling_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Max Pooling</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_max_pooling">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-block-type="html" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_max_pooling" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
It <span options="" class="marginote"><span class="marginote_desc" style="display:none">Both in engineering and in nature</span><span>is typical </span></span> to structure filter banks into a <em>pyramid</em>, in which the image sizes get smaller in successive layers of processing. The idea is that we find local patterns, like bits of edges in the early layers, and then look for patterns in those patterns, etc. This means that, effectively, we are looking for patterns in larger pieces of the image as we apply successive filters. Having a stride greater than one makes the images smaller, but does not necessarily aggregate information over that spatial range. </p><p>
Another common layer type, which accomplishes this aggregation, is <em>max pooling</em>. A max pooling layer operates like a filter, but has no weights. <em>You can think of it as a pure functional layer, like a ReLU layer in a fully connected network.</em> It has a filter size, as in a filter layer, but simply returns the maximum value <span options="" class="marginote"><span class="marginote_desc" style="display:none">We sometimes use the term <em>receptive field</em> or just <em>field</em> to mean the area of an input image that a filter is being applied to.</span><span>in its field. </span></span> Usually, we apply max pooling with the following traits: </p><ul class="itemize"><li><p>
[mathjaxinline]\text {stride} > 1[/mathjaxinline], so that the resulting image is smaller than the input image; and </p></li><li><p>
[mathjaxinline]k \geq \text {stride}[/mathjaxinline], so that the whole image is covered. </p></li></ul><p>
As a result of applying a max pooling layer, we don't keep track of the precise location of a pattern. This helps our filters to learn to recognize patterns independent of their location. </p><p>
Consider a max pooling layer of [mathjaxinline]\text {stride} = k = 2[/mathjaxinline]. This would map a [mathjaxinline]64 \times 64 \times 3[/mathjaxinline] image to a [mathjaxinline]32 \times 32 \times 3[/mathjaxinline] image. <br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">Maximilian Poole thinks it would be a good idea to add two max pooling layers of size [mathjaxinline]k[/mathjaxinline], one right after the other, to their network. What single layer would be equivalent?</span> <br/></p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/41c7c4a6141b76b324055d56387570c0/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Convolutional_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:14 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07g_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - typical architecture</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - typical architecture</h3>
<div
id="video_MIT6036L07g"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:9l7EiBobcTs", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07g/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07g"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@MIT6036L07h_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Lecture: CNNs - backprop and gradient descent</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-block-type="video" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Lecture: CNNs - backprop and gradient descent</h3>
<div
id="video_MIT6036L07h"
class="video closed"
data-metadata='{"saveStateEnabled": false, "autoplay": false, "publishCompletionUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h/handler/publish_completion", "start": 0.0, "prioritizeHls": false, "saveStateUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h/handler/xmodule_handler/save_user_state", "recordedYoutubeIsAvailable": true, "streams": "1.00:wjsRjggZcu0", "autoAdvance": false, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h/handler/transcript/available_translations", "captionDataDir": null, "ytMetadataEndpoint": "", "showCaptions": "true", "lmsRootURL": "https://openlearninglibrary.mit.edu", "transcriptTranslationUrl": "/courses/course-v1:MITx+6.036+1T2019/xblock/block-v1:MITx+6.036+1T2019+type@video+block@MIT6036L07h/handler/transcript/translation/__lang__", "ytApiUrl": "https://www.youtube.com/iframe_api", "transcriptLanguages": {"en": "English"}, "speed": null, "autohideHtml5": false, "generalSpeed": 1.0, "transcriptLanguage": "en", "savedVideoPosition": 0.0, "poster": null, "sources": [], "duration": 0.0, "end": 0.0, "completionEnabled": false, "completionPercentage": 0.95, "ytTestTimeout": 1500}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="MIT6036L07h"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-block-type="vertical" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="VerticalStudentView" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@vertical+block@convolutional_neural_networks_typical_architecture_vert" data-graded="False" data-runtime-class="LmsRuntime">
<h2 class="hd hd-2 unit-title">Typical architecture</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_typical_architecture">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-block-type="html" data-has-score="False" data-runtime-version="1" data-course-id="course-v1:MITx+6.036+1T2019" data-init="XBlockToXModuleShim" data-request-token="35256e8e03ea11f099b702fa081815af" data-usage-id="block-v1:MITx+6.036+1T2019+type@html+block@convolutional_neural_networks_typical_architecture" data-graded="False" data-runtime-class="LmsRuntime">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>
Here is the form of a typical convolutional network: </p><div id="a0000000003" class="figure"><p><img src="/assets/courseware/v1/5dbb96f16b91ac0639a42e2dfbd0d901/asset-v1:MITx+6.036+1T2019+type@asset+block/images_cnn.jpg" width="400" style="width:\textwidth"/></p><div class="caption"><b>Figure 1</b>: <span>*</span></div><p>
Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html </p></div><p>
After each filter layer there is generally a ReLU layer; there maybe be multiple filter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max pooling. Once the output is down to a relatively small size, there is typically a last fully-connected layer, leading into an activation function such as softmax that produces the final output. The exact design of these structures is an art—there is not currently any clear theoretical (or even systematic empirical) understanding of how these various design choices affect overall performance of the network. </p><p>
The critical point for us is that this is all just a big neural network, which takes an input and computes an output. The mapping is <span options="" class="marginote"><span class="marginote_desc" style="display:none">Well, the derivative is not continuous, both because of the ReLU and the max pooling operations, but we ignore that fact.</span><span>a differentiable function </span></span> of the weights, which means we can adjust the weights to decrease the loss by performing gradient descent, and we can compute the relevant gradients using back-propagation! </p><p>
Let's work through a <em>very</em> simple example of how back-propagation can work on a convolutional network. The architecture is shown below. Assume we have a one-dimensional single-channel image, of size [mathjaxinline]n \times 1 \times 1[/mathjaxinline] and a single [mathjaxinline]k \times 1 \times 1[/mathjaxinline] filter in the first convolutional layer. Then we pass it through a ReLU layer and a fully-connected layer with no additional activation function on the output. </p><center><p><img src="/assets/courseware/v1/4cbbd5d71573f05f2cd89685068aa4b8/asset-v1:MITx+6.036+1T2019+type@asset+block/images_convolutional_neural_networks_typical_architecture_tikzpicture_1-crop.png" width="749"/></p></center><p>
For simplicity assume [mathjaxinline]k[/mathjaxinline] is odd, let the input image [mathjaxinline]X = A^0[/mathjaxinline], and assume we are using squared loss. Then we can describe the forward pass as follows: </p><table id="a0000000004" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000005"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none">
[mathjaxinline]\displaystyle Z_ i^1[/mathjaxinline]
</td><td style="vertical-align:middle; text-align:left; border:none">
[mathjaxinline]\displaystyle = {W^1}^ T \cdot A^0_{[i-\lfloor k/2 \rfloor : i + \lfloor k/2 \rfloor ]}[/mathjaxinline]
</td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000006"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none">
[mathjaxinline]\displaystyle A^1[/mathjaxinline]
</td><td style="vertical-align:middle; text-align:left; border:none">
[mathjaxinline]\displaystyle = ReLU(Z^1)[/mathjaxinline]
</td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000007"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none">
[mathjaxinline]\displaystyle A^2[/mathjaxinline]
</td><td style="vertical-align:middle; text-align:left; border:none">
[mathjaxinline]\displaystyle = {W^2}^ T A^1[/mathjaxinline]
</td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr><tr id="a0000000008"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none">
[mathjaxinline]\displaystyle L(A^2, y)[/mathjaxinline]
</td><td style="vertical-align:middle; text-align:left; border:none">
[mathjaxinline]\displaystyle = (A^2-y)^2[/mathjaxinline]
</td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table><p>
<br/> <br/><span style="color:#FF0000"><b class="bf">Study Question:</b></span> <span style="color:#0000FF">For a filter of size [mathjaxinline]k[/mathjaxinline], how much padding do we need to add to the top and bottom of the image?</span> <br/></p><p>
How do we update the weights in filter [mathjaxinline]W^1[/mathjaxinline]? </p><table id="a0000000009" class="equation" width="100%" cellspacing="0" cellpadding="7" style="table-layout:auto"><tr><td class="equation" style="width:80%; border:none">[mathjax]\frac{\partial \text {loss}}{\partial W^1} = \frac{\partial Z^1}{\partial W^1} \cdot \frac{\partial A^1}{\partial Z^1} \cdot \frac{\partial \text {loss}}{\partial A^1}[/mathjax]</td><td class="eqnnum" style="width:20%; border:none"> </td></tr></table><ul class="itemize"><li><p>
[mathjaxinline]\partial Z^1/\partial W^1[/mathjaxinline] is the [mathjaxinline]k \times n[/mathjaxinline] matrix such that [mathjaxinline]\partial Z_ i^1/\partial W_ j^1 = X_{i-\lfloor k/2 \rfloor +j-1}[/mathjaxinline]. So, for example, if [mathjaxinline]i = 10[/mathjaxinline], which corresponds to column 10 in this matrix, which illustrates the dependence of pixel 10 of the output image on the weights, and if [mathjaxinline]k = 5[/mathjaxinline], then the elements in column 10 will be [mathjaxinline]X_8, X_9, X_{10}, X_{11}, X_{12}[/mathjaxinline]. </p></li><li><p>
[mathjaxinline]\partial A^1/\partial Z^1[/mathjaxinline] is the [mathjaxinline]n \times n[/mathjaxinline] diagonal matrix such that </p><table id="a0000000010" cellpadding="7" width="100%" cellspacing="0" class="eqnarray" style="table-layout:auto"><tr id="a0000000011"><td style="width:40%; border:none"> </td><td style="vertical-align:middle; text-align:right; border:none">
[mathjaxinline]\displaystyle \partial A_ i^1/\partial Z_ i^1= \begin{cases} 1 & \text {if $Z_ i^1 > 0$} \\ 0 & \text {otherwise} \end{cases}[/mathjaxinline]
</td><td style="width:40%; border:none"> </td><td style="width:20%; border:none" class="eqnnum"> </td></tr></table></li><li><p>
[mathjaxinline]\partial \text {loss}/\partial A^1 = \partial \text {loss}/\partial A^2 \cdot \partial A^2/\partial A^1 = 2(A^2 - y)W^2[/mathjaxinline], an [mathjaxinline]n \times 1[/mathjaxinline] vector </p></li></ul><p>
Multiplying these components yields the desired gradient, of shape [mathjaxinline]k \times 1[/mathjaxinline]. </p><p>
<br/></p><p>
<br/></p><p><a href="/assets/courseware/v1/41c7c4a6141b76b324055d56387570c0/asset-v1:MITx+6.036+1T2019+type@asset+block/notes_chapter_Convolutional_Neural_Networks.pdf" target="_blank">Download this chapter as a PDF file</a></p><script src="/assets/courseware/v1/1ab2c06aefab58693cfc9c10394b7503/asset-v1:MITx+6.036+1T2019+type@asset+block/marginotes.js" type="text/javascript"/><span><br/><span style="color:gray;font-size:10pt"><center>This page was last updated on Friday May 24, 2019; 02:29:14 PM (revision 4f166135)</center></span></span>
</div>
</div>
</div>
</div>
© All Rights Reserved