<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@223eef19ae734640bf0aa90ce95f8e4e" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Introduction</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ee94eed1b2a64d54830d2407a24a7c88">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ee94eed1b2a64d54830d2407a24a7c88" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p><strong>Missing data</strong></p>
<p>Missing Data affects most databases: complete information simply isn't available for every measure or timepoint contained within the dataset.</p>
<p><strong>Electronic Medical Records suffer from the same challenge.</strong></p>
<p>Statistical models depend on having complete data to use for both exposure and outcome variables, and so we must figure out how to deal with missing data: do we remove incomplete entries and (dramatically) shrink our dataset? Do we replace missing values with estimates (imputation)? And how do we evaluate these strategies? </p>
<p>Here, we will discuss the varying sources of missingness, how this influences our approach in using the dataset, and how we may evaluate different missing data strategies against one another. </p>
<p><strong>Types of Missingness</strong></p>
<p>We will look at these in more detail through the following sections, but in summary we see a few types of "missingness:"</p>
<ol><ol><ol>
<li>The value is missing because it is forgotten, lost, or was never entered. For example, we measured some value but it was never entered into the EMR. </li>
<li>The value is missing because it is not applicable to the instance. For example, we don't have ventilator information for a patient that was taken off the ventilator, and so these records are left empty in the database. </li>
<li>The value is missing because it is of no interest to the instance. For example, we didn't measure something about a patient because it is not relevant to their current condition. </li>
</ol></ol></ol>
<p>A key concern is whether we can identify the cause of missing data at all. Imputation involves making an educated guess as to what a value may be for a given time point, but if we know distinctly why the data is missing, we may be introducing bias to our analysis by imputing values here. This data is said to be "non-recoverable." On the other hand, if the data is missing for random or unintended reasons (such as a technical error preventing measurements from being recorded to the server), we may be able to impute what the data should be. This is classified as "recoverable."</p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@b2a53c95d7b2416aa982d3fe42658d54">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@b2a53c95d7b2416aa982d3fe42658d54" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Learning Objectives</h3>
<p></p>
<p>In this unit, we will be focusing on how to handle missing data within datasets. During the course of this lesson, you should learn:</p>
<ul>
<li>What are the different types of missing data, and the sources for missingness.</li>
<li>What options are available for dealing with missing data.</li>
<li>What techniques exist to help choose the most appropriate technique for a specific dataset.</li>
</ul>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@b236ab5ce5ce4281a550635ef9a6e190">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@b236ab5ce5ce4281a550635ef9a6e190" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Credits</h3>
<p>Book chapter: Cátia Salgado, Carlos Azevedo, Hugo Proença and Susana Vieira.</p>
<p>EdX content: Marta Fernandes and Dan Ebner.</p>
<p>Video: The videos in this unit are presented by <span style="color: #313131; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif;">Jesse Raffa.</span></p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@051ccb6d34f64bacbcfc3dc90eb9d6bb" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Missing Data Definition</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="video" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Missing data</h3>
<div
id="video_8667bb2927ae441e802940e7189b3866"
class="video closed"
data-metadata='{"end": 0.0, "speed": null, "ytTestTimeout": 1500, "publishCompletionUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/publish_completion", "ytApiUrl": "https://www.youtube.com/iframe_api", "showCaptions": "true", "streams": "1.00:SS54Qq1CZwk", "poster": null, "saveStateEnabled": false, "start": 0.0, "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/transcript/available_translations", "autoplay": false, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/transcript/translation/__lang__", "ytMetadataEndpoint": "", "generalSpeed": 1.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autoAdvance": false, "completionEnabled": false, "savedVideoPosition": 0.0, "saveStateUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/xmodule_handler/save_user_state", "captionDataDir": null, "sources": [], "transcriptLanguage": "en", "recordedYoutubeIsAvailable": true, "prioritizeHls": false, "duration": 0.0}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="8667bb2927ae441e802940e7189b3866"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
<h3 class="hd hd-4 downloads-heading sr" id="video-download-transcripts_8667bb2927ae441e802940e7189b3866">Downloads and transcripts</h3>
<div class="wrapper-downloads" role="region" aria-labelledby="video-download-transcripts_8667bb2927ae441e802940e7189b3866">
<div class="wrapper-download-transcripts">
<h4 class="hd hd-5">Transcripts</h4>
<ul class="list-download-transcripts">
<li class="transcript-option">
<a class="btn btn-link" href="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/transcript/download" data-value="srt">Download SubRip (.srt) file</a>
</li>
<li class="transcript-option">
<a class="btn btn-link" href="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@8667bb2927ae441e802940e7189b3866/handler/transcript/download" data-value="txt">Download Text (.txt) file</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_0ebd832052bb4a04a5baed84ca18e3d3" class="problems-wrapper" role="group"
aria-labelledby="0ebd832052bb4a04a5baed84ca18e3d3-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="0ebd832052bb4a04a5baed84ca18e3d3-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3-problem-progress" tabindex="-1">
Question 1
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@0ebd832052bb4a04a5baed84ca18e3d3-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_0ebd832052bb4a04a5baed84ca18e3d3_2_1">
<fieldset aria-describedby="status_0ebd832052bb4a04a5baed84ca18e3d3_2_1">
<legend id="0ebd832052bb4a04a5baed84ca18e3d3_2_1-legend" class="response-fieldset-legend field-group-hd">What is imputation?</legend>
<div class="field">
<input type="radio" name="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1" id="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="0ebd832052bb4a04a5baed84ca18e3d3_2_1-choice_0-label" for="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_0ebd832052bb4a04a5baed84ca18e3d3_2_1"> Manually inputting "best guess" values for missing data.
</label>
</div>
<div class="field">
<input type="radio" name="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1" id="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="0ebd832052bb4a04a5baed84ca18e3d3_2_1-choice_1-label" for="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_0ebd832052bb4a04a5baed84ca18e3d3_2_1"> Statistical process for substituting in values for missing data.
</label>
</div>
<div class="field">
<input type="radio" name="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1" id="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="0ebd832052bb4a04a5baed84ca18e3d3_2_1-choice_2-label" for="input_0ebd832052bb4a04a5baed84ca18e3d3_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_0ebd832052bb4a04a5baed84ca18e3d3_2_1"> Process of removing data with missing values from the dataset for further analysis.
</label>
</div>
<span id="answer_0ebd832052bb4a04a5baed84ca18e3d3_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_0ebd832052bb4a04a5baed84ca18e3d3_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 1" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_0ebd832052bb4a04a5baed84ca18e3d3" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_0ebd832052bb4a04a5baed84ca18e3d3">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="0ebd832052bb4a04a5baed84ca18e3d3-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="0ebd832052bb4a04a5baed84ca18e3d3-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="0ebd832052bb4a04a5baed84ca18e3d3-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@97939a7e93334a2f8def8ecb9e4b22ba">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@97939a7e93334a2f8def8ecb9e4b22ba" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Exercises</h3>
<p>For a hands-on with MIMIC clinical demo dataset, we illustrate some of the techniques to handle and impute missing data in <a href="https://github.com/criticaldata/hst953-edx/blob/master/2.05%20Missing%20Data/Missing%20Data.Rmd" target="[object Object]">this GitHub repository</a>.</p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@1e001154eb5c4d94b6d8be69b86f59be" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Types of Missingness</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e716645911ab414c992db6179b539e1a">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e716645911ab414c992db6179b539e1a" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>As we described in the introduction, the reason why data is missing is critically important to determine how to approach dealing with the missing information. There are probabilistic models for this (which you can find in the Critical Data textbook and videos detailing this lecture), but here we will discuss this in general language.</p>
<p></p>
<p><b>Missing Completely at Random (MCAR):</b></p>
<p>This describes a scenario in which there is no relationship linking the various missing data points together, and they are truly random. In technical terms, the probability that an observation is missing is not dependent on either the data you have or the data that is missing. You might imagine this as a computer glitch causing every other patient's gender to not be recorded: there is no hidden relationship underlying the missingness of the data.</p>
<p></p>
<p><strong>Missing at Random (MAR):</strong></p>
<p>In this case, the probability that the data is missing is dependent on the observable data, and so the missing data can be computed from the observed data as the two are statistically related. Essentially, some observable variable (or other pieces of data) allows us to infer an answer about the missing data. For example, if elderly people were more likely to forget they previously had pneumonia, and so forget to tell you, you would have lots of missing data for your variable "has previously had pneumonia," but this missing data would be tied to the "age" variable, and so may be inferable.</p>
<p>This is a bit of a misnomer: the data is not missing at random but is missing because of a known or identifiable factor that is within your observed data.</p>
<p></p>
<p><strong>Missing Not at Random (MNAR)</strong>:</p>
<p>As you may have guessed from the above, this is when neither MCAR nor MAR hold, and so your missing data depends in some way on the data that is missing, and so cannot be trivially derived -- if at all -- from the observed data that you have. For example, imagine that patients with low blood pressure will have their blood pressure measured less frequently, creating missing data for "blood pressure" that is dependent on the low value for blood pressure.</p>
<p></p>
<p>With some practice, these can easily become second nature.</p>
<p></p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_167b95a588ef4970bb0d148f701f87a5" class="problems-wrapper" role="group"
aria-labelledby="167b95a588ef4970bb0d148f701f87a5-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="167b95a588ef4970bb0d148f701f87a5-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5-problem-progress" tabindex="-1">
Question 1
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@167b95a588ef4970bb0d148f701f87a5-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><p>What form of missing data is this?</p>
<div class="choicegroup capa_inputtype" id="inputtype_167b95a588ef4970bb0d148f701f87a5_2_1">
<fieldset aria-describedby="status_167b95a588ef4970bb0d148f701f87a5_2_1">
<legend id="167b95a588ef4970bb0d148f701f87a5_2_1-legend" class="response-fieldset-legend field-group-hd">In a height test, shorter people have missing observations for their height.</legend>
<div class="field">
<input type="radio" name="input_167b95a588ef4970bb0d148f701f87a5_2_1" id="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="167b95a588ef4970bb0d148f701f87a5_2_1-choice_0-label" for="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_167b95a588ef4970bb0d148f701f87a5_2_1"> MNAR
</label>
</div>
<div class="field">
<input type="radio" name="input_167b95a588ef4970bb0d148f701f87a5_2_1" id="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="167b95a588ef4970bb0d148f701f87a5_2_1-choice_1-label" for="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_167b95a588ef4970bb0d148f701f87a5_2_1"> MAR
</label>
</div>
<div class="field">
<input type="radio" name="input_167b95a588ef4970bb0d148f701f87a5_2_1" id="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="167b95a588ef4970bb0d148f701f87a5_2_1-choice_2-label" for="input_167b95a588ef4970bb0d148f701f87a5_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_167b95a588ef4970bb0d148f701f87a5_2_1"> MCAR
</label>
</div>
<span id="answer_167b95a588ef4970bb0d148f701f87a5_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_167b95a588ef4970bb0d148f701f87a5_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 1" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_167b95a588ef4970bb0d148f701f87a5" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_167b95a588ef4970bb0d148f701f87a5">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="167b95a588ef4970bb0d148f701f87a5-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="167b95a588ef4970bb0d148f701f87a5-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="167b95a588ef4970bb0d148f701f87a5-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_1cd0856393904020b8d65826905799fd" class="problems-wrapper" role="group"
aria-labelledby="1cd0856393904020b8d65826905799fd-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="1cd0856393904020b8d65826905799fd-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd-problem-progress" tabindex="-1">
Question 2
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1cd0856393904020b8d65826905799fd-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><p>What form of missing data is this?</p>
<div class="choicegroup capa_inputtype" id="inputtype_1cd0856393904020b8d65826905799fd_2_1">
<fieldset aria-describedby="status_1cd0856393904020b8d65826905799fd_2_1">
<legend id="1cd0856393904020b8d65826905799fd_2_1-legend" class="response-fieldset-legend field-group-hd">In a height test, young children are missing their height information.</legend>
<div class="field">
<input type="radio" name="input_1cd0856393904020b8d65826905799fd_2_1" id="input_1cd0856393904020b8d65826905799fd_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="1cd0856393904020b8d65826905799fd_2_1-choice_0-label" for="input_1cd0856393904020b8d65826905799fd_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_1cd0856393904020b8d65826905799fd_2_1"> MNAR
</label>
</div>
<div class="field">
<input type="radio" name="input_1cd0856393904020b8d65826905799fd_2_1" id="input_1cd0856393904020b8d65826905799fd_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="1cd0856393904020b8d65826905799fd_2_1-choice_1-label" for="input_1cd0856393904020b8d65826905799fd_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_1cd0856393904020b8d65826905799fd_2_1"> MAR
</label>
</div>
<div class="field">
<input type="radio" name="input_1cd0856393904020b8d65826905799fd_2_1" id="input_1cd0856393904020b8d65826905799fd_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="1cd0856393904020b8d65826905799fd_2_1-choice_2-label" for="input_1cd0856393904020b8d65826905799fd_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_1cd0856393904020b8d65826905799fd_2_1"> MCAR
</label>
</div>
<span id="answer_1cd0856393904020b8d65826905799fd_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_1cd0856393904020b8d65826905799fd_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 2" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_1cd0856393904020b8d65826905799fd" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_1cd0856393904020b8d65826905799fd">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1cd0856393904020b8d65826905799fd-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1cd0856393904020b8d65826905799fd-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1cd0856393904020b8d65826905799fd-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-3" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_042d0c99a94d4e2487f6cf9018e49787" class="problems-wrapper" role="group"
aria-labelledby="042d0c99a94d4e2487f6cf9018e49787-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="042d0c99a94d4e2487f6cf9018e49787-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787-problem-progress" tabindex="-1">
Question 3
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@042d0c99a94d4e2487f6cf9018e49787-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><p>What form of missing data is this?</p>
<div class="choicegroup capa_inputtype" id="inputtype_042d0c99a94d4e2487f6cf9018e49787_2_1">
<fieldset aria-describedby="status_042d0c99a94d4e2487f6cf9018e49787_2_1">
<legend id="042d0c99a94d4e2487f6cf9018e49787_2_1-legend" class="response-fieldset-legend field-group-hd">In a height test, every fifth person tested had their data shredded.</legend>
<div class="field">
<input type="radio" name="input_042d0c99a94d4e2487f6cf9018e49787_2_1" id="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="042d0c99a94d4e2487f6cf9018e49787_2_1-choice_0-label" for="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_042d0c99a94d4e2487f6cf9018e49787_2_1"> MNAR
</label>
</div>
<div class="field">
<input type="radio" name="input_042d0c99a94d4e2487f6cf9018e49787_2_1" id="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="042d0c99a94d4e2487f6cf9018e49787_2_1-choice_1-label" for="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_042d0c99a94d4e2487f6cf9018e49787_2_1"> MAR
</label>
</div>
<div class="field">
<input type="radio" name="input_042d0c99a94d4e2487f6cf9018e49787_2_1" id="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="042d0c99a94d4e2487f6cf9018e49787_2_1-choice_2-label" for="input_042d0c99a94d4e2487f6cf9018e49787_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_042d0c99a94d4e2487f6cf9018e49787_2_1"> MCAR
</label>
</div>
<span id="answer_042d0c99a94d4e2487f6cf9018e49787_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_042d0c99a94d4e2487f6cf9018e49787_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 3" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_042d0c99a94d4e2487f6cf9018e49787" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_042d0c99a94d4e2487f6cf9018e49787">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="042d0c99a94d4e2487f6cf9018e49787-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="042d0c99a94d4e2487f6cf9018e49787-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="042d0c99a94d4e2487f6cf9018e49787-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@55f47778af614b0fa26d9a08d0805579" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">How to Handle Missing Data</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ce44b19e7d964bc5a88235d638a08380">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ce44b19e7d964bc5a88235d638a08380" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>So what do we do about missing data? Should we ignore it, make up random data to fill its place, or delete all the missing data? This answer has mathematical solutions and processes behind it, and we're going to talk about these next. </p>
<p><b>How much data is missing?</b></p>
<p>This is an important consideration to look at from the start: it is important to calculate the percentage of missing data for each variable you are considering before moving forward. This will start to clarify what your next steps might be. For example, if 1% of height data is missing, then it may be most effective to remove the patients with those missing data sets for your analysis. If 99% of your data is missing, then it may be more helpful to remove that variable from all of your patients. If your data is somewhere in the middle, however, you will have to consider other options. </p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_1172b554b5514a0aac422eb93c01be77" class="problems-wrapper" role="group"
aria-labelledby="1172b554b5514a0aac422eb93c01be77-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="1172b554b5514a0aac422eb93c01be77-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77-problem-progress" tabindex="-1">
Question 1
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1172b554b5514a0aac422eb93c01be77-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_1172b554b5514a0aac422eb93c01be77_2_1">
<fieldset aria-describedby="status_1172b554b5514a0aac422eb93c01be77_2_1">
<legend id="1172b554b5514a0aac422eb93c01be77_2_1-legend" class="response-fieldset-legend field-group-hd">You have a database with 10 columns and 10,000 lines of data, but only 500 of them are filled in for column: "location". What might be the best way to handle this?</legend>
<div class="field">
<input type="radio" name="input_1172b554b5514a0aac422eb93c01be77_2_1" id="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="1172b554b5514a0aac422eb93c01be77_2_1-choice_0-label" for="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_1172b554b5514a0aac422eb93c01be77_2_1"> Fill in the remaining 9,500 lines using statistical algorithms and averages.
</label>
</div>
<div class="field">
<input type="radio" name="input_1172b554b5514a0aac422eb93c01be77_2_1" id="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="1172b554b5514a0aac422eb93c01be77_2_1-choice_1-label" for="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_1172b554b5514a0aac422eb93c01be77_2_1"> Delete the column and use the remaining 9 columns of data.
</label>
</div>
<div class="field">
<input type="radio" name="input_1172b554b5514a0aac422eb93c01be77_2_1" id="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="1172b554b5514a0aac422eb93c01be77_2_1-choice_2-label" for="input_1172b554b5514a0aac422eb93c01be77_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_1172b554b5514a0aac422eb93c01be77_2_1"> Apply TensorFlow machine learning on the existing data to generate "best guess" estimates about what is missing.
</label>
</div>
<span id="answer_1172b554b5514a0aac422eb93c01be77_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_1172b554b5514a0aac422eb93c01be77_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 1" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_1172b554b5514a0aac422eb93c01be77" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_1172b554b5514a0aac422eb93c01be77">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1172b554b5514a0aac422eb93c01be77-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1172b554b5514a0aac422eb93c01be77-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1172b554b5514a0aac422eb93c01be77-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e72b6c07d47a48efa81352bc5c64b934">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e72b6c07d47a48efa81352bc5c64b934" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p><strong>Overview of Handling Missing Data</strong></p>
<p>Whether missing data is MCAR, MAR, or MNAR shapes how you approach processing the missing data. In general, we aim for the simplest approach and one that minimizes the amount of bias introduced into our analysis. </p>
<p>MCAR and MAR data can be safely ignored, and so any method can be safely applied. This is a terrific advantage, but it is difficult to specifically<span style="font-size: 1em;"> determine that data is MCAR or MAR. To test this, we can compare multiple methods of processing the data and use the differences in the results to see if your assumptions make sense. </span></p>
<p>The key methods for handling data are:</p>
<ol>
<li>Deletion methods</li>
<li>Single imputation methods</li>
<li>Model-based methods</li>
</ol>
</div>
</div>
<div class="vert vert-3" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ed773547b7ee43e0a30b017f9c679163">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@ed773547b7ee43e0a30b017f9c679163" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p><strong>Deletion Methods</strong></p>
<p>The simplest way of handling missing data is to simply delete the cases with missing values. Generally, this leads to valid inferences only for MCAR.</p>
<p><em>Complete-Case Analysis (Listwise Analysis)</em></p>
<p>All observations with missing data are discarded so that only individuals with complete data are kept. For example, if you have 100 variables and a single variable is missing, then that entry will be deleted.</p>
<p>This assumes that the missing data is truly random, and so the remaining data are representative of the population, and there is no subgroup bias created. Because this assumes an MCAR mechanism, this is fairly restrictive. However, it is distinctly simple and reasonable to use when the number of observations to discard is relatively small. </p>
<p>However, power is reduced (smaller resulting sample size) which leads to larger standard errors in the analysis, and if the data is not actually MCAR, bias is likely introduced. </p>
<p></p>
<p><em>Available-Case Analysis</em></p>
<p>Only the data needed for the specific analysis at hand are deleted, as opposed to all entries. If 10% of your patients are missing a data entry about "height," when only a data entry about "weight" is needed, then you keep all the entries with their weight and only delete those who are missing height. If you need both height and weight for analysis, then any cases missing either height or weight can be removed. This preserves more information, but the populations of each analysis may be different and hence non-comparable. </p>
<p>We will evaluate Weighting-Case Analysis next. </p>
</div>
</div>
<div class="vert vert-4" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_682edba06d3b4ec4a7a6d3d1dd28327c" class="problems-wrapper" role="group"
aria-labelledby="682edba06d3b4ec4a7a6d3d1dd28327c-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="682edba06d3b4ec4a7a6d3d1dd28327c-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c-problem-progress" tabindex="-1">
Question 2
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@682edba06d3b4ec4a7a6d3d1dd28327c-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_682edba06d3b4ec4a7a6d3d1dd28327c_2_1">
<fieldset aria-describedby="status_682edba06d3b4ec4a7a6d3d1dd28327c_2_1">
<legend id="682edba06d3b4ec4a7a6d3d1dd28327c_2_1-legend" class="response-fieldset-legend field-group-hd">When might the complete-case analysis method of deletion be applied?</legend>
<div class="field">
<input type="radio" name="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1" id="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="682edba06d3b4ec4a7a6d3d1dd28327c_2_1-choice_0-label" for="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_682edba06d3b4ec4a7a6d3d1dd28327c_2_1"> When your data was corrupted, and a specific section of the table was lost.
</label>
</div>
<div class="field">
<input type="radio" name="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1" id="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="682edba06d3b4ec4a7a6d3d1dd28327c_2_1-choice_1-label" for="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_682edba06d3b4ec4a7a6d3d1dd28327c_2_1"> When you need to use specific datapoints throughout the dataset.
</label>
</div>
<div class="field">
<input type="radio" name="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1" id="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="682edba06d3b4ec4a7a6d3d1dd28327c_2_1-choice_2-label" for="input_682edba06d3b4ec4a7a6d3d1dd28327c_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_682edba06d3b4ec4a7a6d3d1dd28327c_2_1"> When your analysis hinges on having all columns of data available, and missing data is random.
</label>
</div>
<span id="answer_682edba06d3b4ec4a7a6d3d1dd28327c_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_682edba06d3b4ec4a7a6d3d1dd28327c_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Question 2" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_682edba06d3b4ec4a7a6d3d1dd28327c" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_682edba06d3b4ec4a7a6d3d1dd28327c">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="682edba06d3b4ec4a7a6d3d1dd28327c-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="682edba06d3b4ec4a7a6d3d1dd28327c-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="682edba06d3b4ec4a7a6d3d1dd28327c-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@aef7edfa86394ff29da9ca30d702eff0" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Handling Missing Data: Weighting-Case Analysis</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@42aac6475fda452994f93028962e99ed">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@42aac6475fda452994f93028962e99ed" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>Weighting models the degree of missingness to reduce bias introduced in deleting missing data, essentially calculating replacements for the missing values. This process is called imputation. Functionally speaking, you are building a model of your data and then using that model -- whatever it looks like -- to estimate what is missing. </p>
<p><strong>Single-Value Imputation</strong></p>
<p>Here, we are trying to generate singular replacements for empty cells in our table, and through the various methods, we may generate different kinds of predicted values for the missing data at hand. </p>
<p><em>Mean and Median</em></p>
<p>You have probably done this before in your own data without realizing it: you simply replace the missing value with an average or median of your total data set. Median would be the best approach if your data contains a significant amount of outliers, as the mean may be more skewed by these.</p>
<p>The downsides, however, are that this reduces natural variability within your data and lowers the accuracy of your error estimates, while also disregarding any natural relationship between variables and reducing intervariable correlation within your data. This introduces its own set of biases.</p>
<p><em>Linear Interpolation</em></p>
<p>This method is great for time series. You interpolate the values of the previous and next measurements of the patient and then fill in the data as needed. Assume that your weight was 60 kg yesterday and will be 62 kg tomorrow, then you can reasonably assume today's weight is close to 61 kg. </p>
<p><i>Hot Deck and Cold Deck</i></p>
<p>These methods are related to one another. With Hot Deck, you replace missing values with values from an estimated distribution of your current dataset -- this is used in survey research. You first partition the data into clusters, then associate the missing data with each cluster, utilizing each cluster to fill in the missing value (this could involve using the mean, median, or mode within the associated clusters). </p>
<p>Cold deck imputation is similar, but the data source is different. A hot deck tries to preserve the distribution of your variable distribution but may underestimate standard error and variability. </p>
<p><i>Last Observation Carried Forward</i></p>
<p>AKA "sample and hold," this method works well with longitudinal designs. You impute the missing value based on the last available observation of an individual (ie. you do not have the patient's weight for today, but you do from a month ago, so you use that value). This method makes the assumption that the individual has not changed since their last observation, which is often untrue. </p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12">
<div class="xblock xblock-public_view xblock-public_view-video xmodule_display xmodule_VideoBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="video" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Video"}
</script>
<h3 class="hd hd-2">Missing data imputation</h3>
<div
id="video_64b00f03005a4cd79527dd80de936d12"
class="video closed"
data-metadata='{"end": 0.0, "speed": null, "ytTestTimeout": 1500, "publishCompletionUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/publish_completion", "ytApiUrl": "https://www.youtube.com/iframe_api", "showCaptions": "true", "streams": "1.00:dODUTJWa61c", "poster": null, "saveStateEnabled": false, "start": 0.0, "completionPercentage": 0.95, "transcriptAvailableTranslationsUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/transcript/available_translations", "autoplay": false, "transcriptLanguages": {"en": "English"}, "autohideHtml5": false, "transcriptTranslationUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/transcript/translation/__lang__", "ytMetadataEndpoint": "", "generalSpeed": 1.0, "lmsRootURL": "https://openlearninglibrary.mit.edu", "autoAdvance": false, "completionEnabled": false, "savedVideoPosition": 0.0, "saveStateUrl": "/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/xmodule_handler/save_user_state", "captionDataDir": null, "sources": [], "transcriptLanguage": "en", "recordedYoutubeIsAvailable": true, "prioritizeHls": false, "duration": 0.0}'
data-bumper-metadata='null'
data-autoadvance-enabled="False"
data-poster='null'
tabindex="-1"
>
<div class="focus_grabber first"></div>
<div class="tc-wrapper">
<div class="video-wrapper">
<span tabindex="0" class="spinner" aria-hidden="false" aria-label="Loading video player"></span>
<span tabindex="-1" class="btn-play fa fa-youtube-play fa-2x is-hidden" aria-hidden="true" aria-label="Play video"></span>
<div class="video-player-pre"></div>
<div class="video-player">
<div id="64b00f03005a4cd79527dd80de936d12"></div>
<h4 class="hd hd-4 video-error is-hidden">No playable video sources found.</h4>
<h4 class="hd hd-4 video-hls-error is-hidden">
Your browser does not support this video format. Try using a different browser.
</h4>
</div>
<div class="video-player-post"></div>
<div class="closed-captions"></div>
<div class="video-controls is-hidden">
<div>
<div class="vcr"><div class="vidtime">0:00 / 0:00</div></div>
<div class="secondary-controls"></div>
</div>
</div>
</div>
</div>
<div class="focus_grabber last"></div>
<h3 class="hd hd-4 downloads-heading sr" id="video-download-transcripts_64b00f03005a4cd79527dd80de936d12">Downloads and transcripts</h3>
<div class="wrapper-downloads" role="region" aria-labelledby="video-download-transcripts_64b00f03005a4cd79527dd80de936d12">
<div class="wrapper-download-transcripts">
<h4 class="hd hd-5">Transcripts</h4>
<ul class="list-download-transcripts">
<li class="transcript-option">
<a class="btn btn-link" href="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/transcript/download" data-value="srt">Download SubRip (.srt) file</a>
</li>
<li class="transcript-option">
<a class="btn btn-link" href="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@video+block@64b00f03005a4cd79527dd80de936d12/handler/transcript/download" data-value="txt">Download Text (.txt) file</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@646f0bdd7cb24fa794939ba3b1a5dc15">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@646f0bdd7cb24fa794939ba3b1a5dc15" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>For a hands-on with MIMIC clinical demo dataset, we illustrate some of the techniques to handle and impute missing data in <a href="https://github.com/criticaldata/hst953-edx/blob/master/2.05%20Missing%20Data/Missing%20Data.Rmd" target="[object Object]">this GitHub repository</a>.</p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@74a3dfc9b9a543c68707ff4023c3270a" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Handling Missing Data: Model-Based Imputation</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@7ee23439beda4436b7aabb8b79170d0d">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@7ee23439beda4436b7aabb8b79170d0d" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>When we started this unit, we talked about models and figuring out the best values for your missing data by building a model instead of doing a little bit of averaging. Here, we will talk about how to approach that modeling. </p>
<p>For model-based imputation, a predictive model is generated to more dynamically estimate values for substitution into the data. We separate our data into those with complete and those with incomplete cases within any given category, generating a predictive model from the complete target variable data versus the other variables at hand. This model is then used to estimate the missing data for all the different categories. </p>
<p>A number of methods can be used, many of which you have probably heard of before: regression, logistic regression, neural networks, and other parametric/non-parametric modeling techniques. The downside is that the mathematical relationship estimated by the model will be "better behaved" than actual data, and will behave poorly if there is no relationship between the target variable and other surrounding variables (ie. if you use a person's shirt color to predict their shoe size, you'll find your model works poorly). </p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_1e9e5276978f4f538cc10b0a07da50b0" class="problems-wrapper" role="group"
aria-labelledby="1e9e5276978f4f538cc10b0a07da50b0-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="1e9e5276978f4f538cc10b0a07da50b0-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0-problem-progress" tabindex="-1">
Multiple Choice
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@1e9e5276978f4f538cc10b0a07da50b0-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_1e9e5276978f4f538cc10b0a07da50b0_2_1">
<fieldset aria-describedby="status_1e9e5276978f4f538cc10b0a07da50b0_2_1">
<legend id="1e9e5276978f4f538cc10b0a07da50b0_2_1-legend" class="response-fieldset-legend field-group-hd">Generally speaking, what is model-based imputation?</legend>
<div class="field">
<input type="radio" name="input_1e9e5276978f4f538cc10b0a07da50b0_2_1" id="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="1e9e5276978f4f538cc10b0a07da50b0_2_1-choice_0-label" for="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_1e9e5276978f4f538cc10b0a07da50b0_2_1"> Summating all of our information together to build a model, and then using it to simulate the general trends of our dataset.
</label>
</div>
<div class="field">
<input type="radio" name="input_1e9e5276978f4f538cc10b0a07da50b0_2_1" id="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="1e9e5276978f4f538cc10b0a07da50b0_2_1-choice_1-label" for="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_1e9e5276978f4f538cc10b0a07da50b0_2_1"> Using our complete data to build a model, that is then used to "fill in" the incomplete data.
</label>
</div>
<div class="field">
<input type="radio" name="input_1e9e5276978f4f538cc10b0a07da50b0_2_1" id="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="1e9e5276978f4f538cc10b0a07da50b0_2_1-choice_2-label" for="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_1e9e5276978f4f538cc10b0a07da50b0_2_1"> Using an alternative or pre-existing dataset to build a model, that is then used to clean up our own dataset.
</label>
</div>
<div class="field">
<input type="radio" name="input_1e9e5276978f4f538cc10b0a07da50b0_2_1" id="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_3" class="field-input input-radio" value="choice_3"/><label id="1e9e5276978f4f538cc10b0a07da50b0_2_1-choice_3-label" for="input_1e9e5276978f4f538cc10b0a07da50b0_2_1_choice_3" class="response-label field-label label-inline" aria-describedby="status_1e9e5276978f4f538cc10b0a07da50b0_2_1"> A new form of fashion modeling.
</label>
</div>
<span id="answer_1e9e5276978f4f538cc10b0a07da50b0_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_1e9e5276978f4f538cc10b0a07da50b0_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Multiple Choice" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_1e9e5276978f4f538cc10b0a07da50b0" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_1e9e5276978f4f538cc10b0a07da50b0">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1e9e5276978f4f538cc10b0a07da50b0-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1e9e5276978f4f538cc10b0a07da50b0-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="1e9e5276978f4f538cc10b0a07da50b0-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@0bcef42cb6564c85984ec09b4e736261">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@0bcef42cb6564c85984ec09b4e736261" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Model-Based Imputation</h3>
<p></p>
<p><strong>Linear Regression</strong></p>
<p>This is akin to what you learned back in algebra, but for linear regression. We use all the variables on hand to generate a model that yields estimations of the observations in our target category. It captures relationships between variables (which mean/median/mode do not) but can overestimate the "fit" of the data, emphasizing correlations as stronger than they actually may be, and does not take into account any uncertainties that may exist in the underlying dataset. This leads to the underestimation of variance and covariance. </p>
<p>Multivariate imputation adds another layer of complexity as there are likely missing values across multiple variables, and patterns of missingness may differ across the board. The method used is an iterative process that adjusts the model until it converges; it is likely your favorite statistic program has a preprogrammed package to help you with this. </p>
<p><strong style="font-size: 1em;">Stochastic Regression</strong></p>
<p>Stochastic regression reduces bias by adding an additional step: augmenting each predicted score with a residual term. This term is normally distributed with a mean of zero and a variance equal to the residual variance found in the predictor and aims to maintain the underlying variability found within the dataset, reducing the amount of bias added. Nonetheless, the standard error tends to still be underestimated, because uncertainty in imputed values is not included. This can increase the risk of type I error in your model. </p>
<p></p>
</div>
</div>
<div class="vert vert-3" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_3e2d706c7fe447578acf8f432db17f8a" class="problems-wrapper" role="group"
aria-labelledby="3e2d706c7fe447578acf8f432db17f8a-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="3e2d706c7fe447578acf8f432db17f8a-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a-problem-progress" tabindex="-1">
Multiple Choice
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@3e2d706c7fe447578acf8f432db17f8a-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_3e2d706c7fe447578acf8f432db17f8a_2_1">
<fieldset aria-describedby="status_3e2d706c7fe447578acf8f432db17f8a_2_1">
<legend id="3e2d706c7fe447578acf8f432db17f8a_2_1-legend" class="response-fieldset-legend field-group-hd">What is the key difference between Linear and Stochastic regression?</legend>
<div class="field">
<input type="radio" name="input_3e2d706c7fe447578acf8f432db17f8a_2_1" id="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="3e2d706c7fe447578acf8f432db17f8a_2_1-choice_0-label" for="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_3e2d706c7fe447578acf8f432db17f8a_2_1"> Linear regression makes straight lines when graphed, while stochastic regression does not.
</label>
</div>
<div class="field">
<input type="radio" name="input_3e2d706c7fe447578acf8f432db17f8a_2_1" id="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="3e2d706c7fe447578acf8f432db17f8a_2_1-choice_1-label" for="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_3e2d706c7fe447578acf8f432db17f8a_2_1"> Linear regression uses the model's prediction directly, while stochastic regression adds a random amount of variance to the term generated.
</label>
</div>
<div class="field">
<input type="radio" name="input_3e2d706c7fe447578acf8f432db17f8a_2_1" id="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_2" class="field-input input-radio" value="choice_2"/><label id="3e2d706c7fe447578acf8f432db17f8a_2_1-choice_2-label" for="input_3e2d706c7fe447578acf8f432db17f8a_2_1_choice_2" class="response-label field-label label-inline" aria-describedby="status_3e2d706c7fe447578acf8f432db17f8a_2_1"> Linear regression creates a 1 to 1 relationship mathematically between terms, while stochastic regression factors in multiple variables.
</label>
</div>
<span id="answer_3e2d706c7fe447578acf8f432db17f8a_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_3e2d706c7fe447578acf8f432db17f8a_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Multiple Choice" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_3e2d706c7fe447578acf8f432db17f8a" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_3e2d706c7fe447578acf8f432db17f8a">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="3e2d706c7fe447578acf8f432db17f8a-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="3e2d706c7fe447578acf8f432db17f8a-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="3e2d706c7fe447578acf8f432db17f8a-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
<div class="vert vert-4" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@5fd14cded4d84dcf80a6b3aebabc5cbd">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@5fd14cded4d84dcf80a6b3aebabc5cbd" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p style="text-rendering: optimizelegibility; margin-top: 20px; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;"><strong style="text-rendering: optimizelegibility; margin: 0px; padding: 0px; border: 0px; outline: 0px; font-style: inherit; font-variant: inherit; font-stretch: inherit; font-size: inherit; line-height: 1.4em; font-family: inherit; vertical-align: baseline;">Multiple-Value Imputation</strong></p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;">Multiple imputation is a Monte Carlo technique developed by Rubin in the 1970s for specifically analyzing datasets with missing data. There are three key steps: </p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;">1) Imputation. Fill in the missing values with any method, leading to a number of examples completed datasets (5-10 is generally sufficient). The difference between these datasets -- specifically the difference noted between the imputed values, as the observed values are the same -- captures the level of uncertainty surrounding the imputation.</p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;">2) Analysis. Each completed dataset is analyzed, producing a separate analysis for each dataset.</p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;">3) Pooling. The analyses are integrated together, perhaps by computing the mean of the analysis with a confidence interval. This pooled analysis forms the model. </p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;"><strong style="text-rendering: optimizelegibility; margin: 0px; padding: 0px; border: 0px; outline: 0px; font-style: inherit; font-variant: inherit; font-stretch: inherit; font-size: 1em; line-height: 1.4em; font-family: inherit; vertical-align: baseline;">K-Nearest Neighbors</strong></p>
<p style="text-rendering: optimizelegibility; margin-right: 0px; margin-left: 0px; padding: 0px; border: 0px; outline: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; vertical-align: baseline; color: #313131;">In this method, the "k" nearest observations to the missing observation are identified, and then the mean of this cluster is used to fill in the missing value. Evaluating the similarity of any two observations may be the tricky part, but after normalizing the dataset, a Euclidean, Manhattan, Mahalanobis, Pearson, or other distance function can be applied across the dataset to identify those data-complete observations most similar to the observation with missing data. With enough data, this may produce the most accurate estimation of the data that is missing, operating both on a qualitative or quantitative level, while maintaining the underlying correlation structure in the data. The choice of k-value is critical to the method's success and utility: higher values of k include observations increasingly different from the missing data value, which may include significantly different observations, while low values of k may leave out significant or even critical attributes that ideally will be captured. Finding a good medium can be a key challenge. </p>
</div>
</div>
<div class="vert vert-5" data-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158">
<div class="xblock xblock-public_view xblock-public_view-problem xmodule_display xmodule_ProblemBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="problem" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="True" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "Problem"}
</script>
<div id="problem_70b26f83c7b149159e6a85159485f158" class="problems-wrapper" role="group"
aria-labelledby="70b26f83c7b149159e6a85159485f158-problem-title"
data-problem-id="block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158" data-url="/courses/course-v1:MITx+HST.953x+3T2020/xblock/block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158/handler/xmodule_handler"
data-problem-score="0"
data-problem-total-possible="1"
data-attempts-used="0"
data-content="
<h3 class="hd hd-3 problem-header" id="70b26f83c7b149159e6a85159485f158-problem-title" aria-describedby="block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158-problem-progress" tabindex="-1">
Multiple Choice
</h3>
<div class="problem-progress" id="block-v1:MITx+HST.953x+3T2020+type@problem+block@70b26f83c7b149159e6a85159485f158-problem-progress"></div>
<div class="problem">
<div>
<div class="wrapper-problem-response" tabindex="-1" aria-label="Question 1" role="group"><div class="choicegroup capa_inputtype" id="inputtype_70b26f83c7b149159e6a85159485f158_2_1">
<fieldset aria-describedby="status_70b26f83c7b149159e6a85159485f158_2_1">
<legend id="70b26f83c7b149159e6a85159485f158_2_1-legend" class="response-fieldset-legend field-group-hd">How might multiple-value imputation differ from K-nearest neighbors?</legend>
<div class="field">
<input type="radio" name="input_70b26f83c7b149159e6a85159485f158_2_1" id="input_70b26f83c7b149159e6a85159485f158_2_1_choice_0" class="field-input input-radio" value="choice_0"/><label id="70b26f83c7b149159e6a85159485f158_2_1-choice_0-label" for="input_70b26f83c7b149159e6a85159485f158_2_1_choice_0" class="response-label field-label label-inline" aria-describedby="status_70b26f83c7b149159e6a85159485f158_2_1"> Multiple value imputation uses any and all the methods, averaging them together, while K-nearest neighbors looks to identify unique clusters of data to base imputations on.
</label>
</div>
<div class="field">
<input type="radio" name="input_70b26f83c7b149159e6a85159485f158_2_1" id="input_70b26f83c7b149159e6a85159485f158_2_1_choice_1" class="field-input input-radio" value="choice_1"/><label id="70b26f83c7b149159e6a85159485f158_2_1-choice_1-label" for="input_70b26f83c7b149159e6a85159485f158_2_1_choice_1" class="response-label field-label label-inline" aria-describedby="status_70b26f83c7b149159e6a85159485f158_2_1"> Multiple value imputation does not impute a single value but can impute all of your missing data, while K-nearest neighbors instead imputes data only K-elements away from the data you currently have.
</label>
</div>
<span id="answer_70b26f83c7b149159e6a85159485f158_2_1"/>
</fieldset>
<div class="indicator-container">
<span class="status unanswered" id="status_70b26f83c7b149159e6a85159485f158_2_1" data-tooltip="Not yet answered.">
<span class="sr">unanswered</span><span class="status-icon" aria-hidden="true"/>
</span>
</div>
</div></div>
</div>
<div class="action">
<input type="hidden" name="problem_id" value="Multiple Choice" />
<div class="submit-attempt-container">
<button type="button" class="submit btn-brand" data-submitting="Submitting" data-value="Submit" data-should-enable-submit-button="True" aria-describedby="submission_feedback_70b26f83c7b149159e6a85159485f158" >
<span class="submit-label">Submit</span>
</button>
<div class="submission-feedback" id="submission_feedback_70b26f83c7b149159e6a85159485f158">
<span class="sr">Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.</span>
</div>
</div>
<div class="problem-action-buttons-wrapper">
</div>
</div>
<div class="notification warning notification-gentle-alert
is-hidden"
tabindex="-1">
<span class="icon fa fa-exclamation-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="70b26f83c7b149159e6a85159485f158-problem-title">
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification warning notification-save
is-hidden"
tabindex="-1">
<span class="icon fa fa-save" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="70b26f83c7b149159e6a85159485f158-problem-title">None
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
<div class="notification general notification-show-answer
is-hidden"
tabindex="-1">
<span class="icon fa fa-info-circle" aria-hidden="true"></span>
<span class="notification-message" aria-describedby="70b26f83c7b149159e6a85159485f158-problem-title">Answers are displayed within the problem
</span>
<div class="notification-btn-wrapper">
<button type="button" class="btn btn-default btn-small notification-btn review-btn sr">Review</button>
</div>
</div>
</div>
"
data-graded="False">
<p class="loading-spinner">
<i class="fa fa-spinner fa-pulse fa-2x fa-fw"></i>
<span class="sr">Loading…</span>
</p>
</div>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@dae3c092e51245fd8aabed3895eff516" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Choosing the Best Method</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@c6556eb6e78740ffb589a9a16b45dc41">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@c6556eb6e78740ffb589a9a16b45dc41" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>Different imputation methods are expected to perform differently on various data- sets. We describe here a generic and simple method that can be used to evaluate the performance of various imputation methods on your own dataset, in order to help to select the most appropriate method. Of note, this simple approach does not test the effect of deletion methods. A more complex approach is described in the case study below, in which the performance of a predictive model is tested on the dataset completed by various imputation methods.</p>
<p>Here is how to proceed:</p>
<p>1. Use a sample of your own dataset that does not contain any missing data (will serve as ground truth).</p>
<p>2. Introduce increasing proportions of missing data at random (e.g. 5-50 % in 5 % increments).</p>
<p>3. Reconstruct the missing data using the various methods.</p>
<p>4. Compute the sum of squared errors between the reconstructed and the original data, for each method and each proportion of missing data.</p>
<p>5. Repeat steps 1-4 a number of times (10 times for example) and compute the average performance of each method (average sum of squared errors (SSE)).</p>
<p>6. Plot the average SSE versus proportion of missing data (1 plot per imputation method), similarly to the example shown in Fig. 1.</p>
<p>7. Choose the method that performs best at the level of missing data in your dataset. E.g. if your data had 10 % of missing data, you would want to pick k-NN; at 40 % linear regression performs better (made-up data, for illustrative purpose only).</p>
<p>Fig.1 - Average SSE between original and reconstructed data, for various levels of missingness and two imputation methods (data only for illustrative purposes).</p>
<p><img src="/assets/courseware/v1/0f619c5b8c41f97006c96dac94a14feb/asset-v1:MITx+HST.953x+3T2020+type@asset+block/ChoiceMethod.jpg" alt="" width="298" height="269" /></p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@44023eedf3e44a609d3b93d3bc002d78" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Case Study: Application of Imputation Methods</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@2786c9ff6095414fb61bdb8d6b465f6d">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@2786c9ff6095414fb61bdb8d6b465f6d" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>In this section, various imputation methods will be applied to two "real world" clinical datasets used in a study that investigated the effect of inserting an indwelling arterial catheter (IAC) in patients with respiratory failure. Two datasets are used and include patients that received an IAC (IAC group) and patients that did not (non-IAC). Each dataset is subdivided into 2 classes, with class 1 corresponding to patients that died within 28 days and class 0 to survivors. The proportion of missing data and potential reasons for missingness are discussed first.</p>
<p></p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e8c66d6be1064c63913fc895371f3ab7">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@e8c66d6be1064c63913fc895371f3ab7" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Proportion of Missing Data and Possible Reasons for Missingness</h3>
<p></p>
<p>The proportion of missing data in some of the variables of the dataset are presented in Table 1. 26 variables represent the subset that was considered for testing the different imputation methods, and were selected based on the assumption that missing data occurring in these variables is recoverable. Since IAC are mainly used for continuous hemodynamic monitoring and for arterial blood sampling for blood gas analysis, we can expect a higher percentage of missing data in blood gas-related variables in the non-IAC group. We can also expect that patient diagnoses are often able to provide an explanation for the lack of specific laboratory results: if a certain test is not ordered because it will most likely provide no clinical insight, a missing value will occur; it is fair to estimate that such value lies within a normal range.</p>
<p>Table 1 - Missing data in some of the variables of the IAC and non-IAC datasets.</p>
<p><img src="/assets/courseware/v1/32342c318c74a5d13ebdb0125ca64d50/asset-v1:MITx+HST.953x+3T2020+type@asset+block/MissingnessExample.jpg" alt="" width="354" height="267" /></p>
<p>In both cases, the fact that data is missing contains information about the response, thus it is MNAR. Body mass index (BMI) has a relatively high percentage of missing data. Assuming that this variable is calculated automatically from the weight and height of patients, we can conclude that this data is MAR: because the height and/or weight are missing, BMI cannot be calculated. If the weight is missing because someone forgot to introduce it into the system, then it is MCAR.</p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@57cf2b33fc834193bfbb7ae4a973ee22" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Case Study: Univariate Missingness Analysis</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@516a5cd2d36741a0a329e55ba335c76f">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@516a5cd2d36741a0a329e55ba335c76f" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>In this section, the specific influence of each imputation method will be explored for the variable age, using all the other variables. Two different levels of missingness (20 and 40 %) were artificially introduced in the datasets. The original dataset represents the ground truth, to which the imputed datasets were compared using frequency histograms.</p>
<p></p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@25668b99d417464dbcf000cc37e43f5f">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@25668b99d417464dbcf000cc37e43f5f" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Complete-Case Analysis</h3>
<p>The complete-case analysis method discards all the incomplete observations with at least one missing value. The distribution of the "imputed" dataset is going to be equal to the original dataset minus the observations that have a missing value in variable age. An example of the distribution of the variable age in the IAC group can be depicted in Fig. 1.</p>
<p>Fig. 1 - Histogram of variable age in the IAC group before and after a univariate complete case method.</p>
<p><img src="/assets/courseware/v1/7a28f5ce6a9c3a04bebd850fcf625bf9/asset-v1:MITx+HST.953x+3T2020+type@asset+block/CompleteCase.jpg" alt="" width="455" height="176" /></p>
<p>This method is only exploitable when there is a small percentage of missing data. This method does not require any assumption in the distribution of the missing data, besides that the complete cases should be representative of the original population, which is difficult to prove.</p>
</div>
</div>
<div class="vert vert-2" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@deacdbc851894fb5845c03b71bfd3adf">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@deacdbc851894fb5845c03b71bfd3adf" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Single Value Imputation</h3>
<p>Mean and median methods are very crude imputation techniques, which ignore the relationship between age and the other variables and introduce a heavy bias towards the mean/median values. These simple methods allow us to better understand the biasing effect, something that is obvious in the examples in Fig. 2.</p>
<p></p>
<p>Fig. 2 - Histogram of variable age in the IAC group before (original) and after (imputed) mean for univariate imputation.</p>
<p><img src="/assets/courseware/v1/bc058f4e7ba957cac1d2a32d98c49527/asset-v1:MITx+HST.953x+3T2020+type@asset+block/MeanMedian.jpg" alt="" width="450" height="176" /></p>
<p><strong>Linear Regression Imputation</strong></p>
<p>The linear regression method imputes most of the data at the center of the distribution (example in Fig. 3). The extremities of the distribution are not well modelled and are easily ignored. This is due to two features of this technique: first, the assumption that the linear regression is a good fit to the data, and second, the assumption that the missing data lays over the regression line, bending the reality to fit the deterministic nature of the model. Compared to the mean/median imputation, the linear regression assumes a relation between the variables, however it overestimates this relation by assuming that the missing points are over the regression line. The model assumes that the percentage of variance explained is 100 %, thus it underestimates variability.</p>
<p>Fig. 3 - Histogram of the variable age in the IAC group before (original) and after (imputed) linear regression for univariate imputation.</p>
<p><img src="/assets/courseware/v1/0c22d7e1bca4020c6d906406c2760d07/asset-v1:MITx+HST.953x+3T2020+type@asset+block/LRimputation.jpg" alt="" width="448" height="175" /></p>
<p><strong>Stochastic Linear Regression Imputation</strong></p>
<p>The stochastic linear regression is an attempt to loosen the deterministic assumption of linear regression. In this case, the distribution of the imputed data better fits the original data than previous methods (Fig. 4). This method can introduce impossible values, such as negative age. It is the first step to model the uncertainty present in the dataset that represents a trade-off between the precision of the values and the uncertainty introduced by the missing data.</p>
<p>Fig. 4 - Histogram of variable age in the IAC group before (original) and after (imputed) stochastic linear regression for univariate imputation.</p>
<p><img src="/assets/courseware/v1/c36cc0124e509973d56c46f185179ef5/asset-v1:MITx+HST.953x+3T2020+type@asset+block/StochasticImputation.jpg" alt="" width="439" height="172" /></p>
<p><strong>K-Nearest Neighbors</strong></p>
<p>We limit the demonstration to the case where k = 1. In the extreme case where all neighbors are used without weights, this method converges to the mean imputation. This method introduces in our particular dataset a huge bias towards the central value as depicted in Fig. 5. The reason for this arises from the fact that almost half of the variables are binary, ending up having a much higher weight on the distances than continuous variables (which are always less than 1, due to the unitary normalization performed in data pre-processing).</p>
<p>Fig. 5 - Histogram of variable age in the IAC group before (original) and after (imputed) KNN for univariate imputation.</p>
<p><img src="/assets/courseware/v1/9e09404edd0ac8f163b170a1b70d2fa4/asset-v1:MITx+HST.953x+3T2020+type@asset+block/knn.jpg" alt="" width="449" height="174" /></p>
<p></p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@5217da5dfe494d299c527303819b5eee" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Case Study: Multiple Imputation and Multivariate Missingness Analysis</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@df3cc8a74b35456ab15284d16ad74057">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@df3cc8a74b35456ab15284d16ad74057" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>Multiple imputation with linear regression and multivariate normal regression are extensions of the single imputation methods of the same name and use sampling to create multiple different datasets, that represent different possibilities of what might be the original dataset. These methods allow a better modeling of the uncertainty present in the missing values and are, usually, more solid in terms of statistical properties and results. We chose to work with 10 datasets, which were averaged so that the graphical representation would look similar to the previous methods.</p>
<p><strong>Multivariate normal regression</strong></p>
<p>Multiple imputation multivariate normal distribution gave more importance to the values of the center of the distribution (Fig. 1). The main assumption of this method is that the data follows a multivariate normal distribution, something that is not completely true for this dataset, which contains numerous binary variables. The multiple imputation method enhances the modeling of uncertainty by adding a bootstrap sampling to the expectation-maximization algorithm, giving rise to better predictions of the possible missing data by considering multiple possibilities of the original data. Obviously, when averaging the data for histogram representation, some of that richness is lost. Nonetheless, the quality of the regression is obvious when compared to the previous methods.</p>
<p>Fig. 1 - Histogram of variable age in the IAC group before (original) and after (imputed) multiple imputation multivariate normal regression for univariate imputation.</p>
<p><img src="/assets/courseware/v1/6899e006f2f6875741d4dd61661689b6/asset-v1:MITx+HST.953x+3T2020+type@asset+block/MultiLR.jpg" alt="" width="426" height="165" /></p>
<p><strong>Linear regression</strong></p>
<p>The multiple imputation linear regression method uses all the variables except the target variable (age) to estimate the missing data of this last variable. The data is modeled using linear regression and Gibbs sampling. This represents by far the most accurate imputation method in this particular dataset as demonstrated in Fig.2.</p>
<p>Fig. 2 - Histogram of variable age in the IAC group before (original) and after (imputed) multiple imputation generalized regression for univariate imputation.</p>
<p><img src="/assets/courseware/v1/8b54f5b77e877f8e38e4e039a8c90dcd/asset-v1:MITx+HST.953x+3T2020+type@asset+block/GeneralLR.jpg" alt="" width="445" height="175" /></p>
<p></p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@43c1eca0b668477e97d34f8bc32e6e32" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Case Study: Imputation Methods on Mortality Prediction</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@922375458dd34d4ea483c8811a99ae96">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@922375458dd34d4ea483c8811a99ae96" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>This test aims to assess the generalization capabilities of the models constructed using imputed data and checks their performance by comparing them to the original data. All the methods described previously were used to reconstruct a sample of both IAC and non-IAC datasets, with increasing proportions of missing data at random, first only on the variable age (univariate), then on all the variables in the dataset (multivariate). All the methods were compared against a reference logistic regression that was fitted with the original data without missingness. The results of the area under the receiver operating characteristic curve (AUC) were averaged over a 10-fold cross-validation.</p>
<p><strong>Univariate missingness</strong></p>
<p>Variable "age" was selected for the assessment of univariate missing. The performance of the regression models with different imputation methods is presented in Fig. 1.</p>
<p>Fig. 1 - Mean AUC performance of the logistic regression models modeled with different imputation methods for different degrees of univariate missingness of the variable "age".</p>
<p><img src="/assets/courseware/v1/66c0e2529414963d2135f3b21eb483f3/asset-v1:MITx+HST.953x+3T2020+type@asset+block/Missingness1.jpg" alt="" width="544" height="241" /></p>
<p>Among univariate techniques, the methods that performed the best on both datasets were:</p>
<ul>
<li>linear regression;</li>
<li>multivariate normal distribution;</li>
<li>one-nearest neighbor algorithm.</li>
</ul>
<p>In the case of univariate missingness, the nearest neighbor algorithm reveals to be a good estimator if several complete observations exist, as is the case. With the increase in missingness, the simpler methods introduced more bias in the modeling of the datasets.</p>
<p><strong>Multivariate missingness</strong></p>
<p>The quality of the imputation methods was also evaluated in the presence of multivariate missingness with a uniform probability in all variables as presented in Fig. 2.</p>
<p>Fig. 2 - Mean AUC of the logistic regression models for different degrees of multivariate missingness.</p>
<p><img src="/assets/courseware/v1/336bf627d306583de52f7c5bcd97bd03/asset-v1:MITx+HST.953x+3T2020+type@asset+block/Missingness2.jpg" alt="" width="533" height="236" /></p>
<p>Overall, the methods had a reasonable performance even for 80 % of missingness in every variable. The reason behind this is that almost half of the variables are binary, and because of their relationship with the output, reconstructing them from frequent values in each class is usually the best guess. The decrease in AUC was due to a decrease in the sensitivity, as the specificity values remained more or less unchanged with the increase in missingness. The method that performed the best overall was the multiple imputation linear regression.</p>
</div>
</div>
</div>
</div>
<div class="xblock xblock-public_view xblock-public_view-vertical" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@vertical+block@c875e6d6776d4a778b09a29ed1c3df43" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="vertical" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="VerticalStudentView">
<h2 class="hd hd-2 unit-title">Conclusion</h2>
<div class="vert-mod">
<div class="vert vert-0" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@2ee4ac50269646ee9bf6788b50c6e6d1">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@2ee4ac50269646ee9bf6788b50c6e6d1" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<p>In this section, we conducted an overview of how we can go about handling Missing Data within a data set. As we've seen, there's a tremendous variety of methods -- with pros and cons to each -- regarding how we can approach this task. </p>
<p>Functionally-speaking, the degree to which the choice of method will impact your final result will depend on the sensitivity of your underlying dataset to fluctuations and change, as well as what you aim to use your ultimate data for. As with many concepts in data science the trouble comes not in applying methods to solve a problem at hand, but more so in deciding which method to use to solve your problem. </p>
<p>The varying ways for handling missing data that we've described here are tools for your toolkit -- be careful to choose the best tool for the job at hand. </p>
</div>
</div>
<div class="vert vert-1" data-id="block-v1:MITx+HST.953x+3T2020+type@html+block@f1fcf61ba02849faa49eaded0c166282">
<div class="xblock xblock-public_view xblock-public_view-html xmodule_display xmodule_HtmlBlock" data-usage-id="block-v1:MITx+HST.953x+3T2020+type@html+block@f1fcf61ba02849faa49eaded0c166282" data-graded="False" data-runtime-class="LmsRuntime" data-course-id="course-v1:MITx+HST.953x+3T2020" data-block-type="html" data-request-token="98c5e878015d11ef8f73026cc65ec0d9" data-has-score="False" data-runtime-version="1" data-init="XBlockToXModuleShim">
<script type="json/xblock-args" class="xblock-json-init-args">
{"xmodule-type": "HTMLModule"}
</script>
<h3>Key Takeaways</h3>
<p> </p>
<p>Always evaluate the reasons for missingness:</p>
<ul>
<li>Is it MCAR/MAR/MNAR?</li>
<li>What is the proportion of missing data per variable and per record?</li>
<li>Multiple imputation approaches generally perform better than other methods. </li>
<li>Evaluation tools must be used to tailor the imputation methods to a particular dataset.</li>
</ul>
</div>
</div>
</div>
</div>
© All Rights Reserved