<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Papadopoulos Lab</title>
<link>https://papadopoulos-lab.github.io/blog.html</link>
<atom:link href="https://papadopoulos-lab.github.io/blog.xml" rel="self" type="application/rss+xml"/>
<description>A collection of thoughts.
</description>
<generator>quarto-1.7.32</generator>
<lastBuildDate>Thu, 13 Apr 2023 22:00:00 GMT</lastBuildDate>
<item>
  <title>Sparse Inference</title>
  <dc:creator>Richard Aubrey White</dc:creator>
  <link>https://papadopoulos-lab.github.io/post/2023-04-14-sparse-inference/sparse-inference.html</link>
  <description><![CDATA[ 





<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Machine learning and statistical modeling are two important tools in data science that are used to make predictions and infer relationships between variables, respectively. Models for prediction aim to estimate the relationship between inputs and outputs in order to make accurate predictions about new, unseen data. In contrast, models for inference aim to understand the underlying relationships between variables in the data, often in the form of identifying causal relationships.</p>
<p>In this blog post, we will explore using models for inference using a simulated dataset, and we will apply penalized regression to perform feature selection on a binary outcome. Penalized regression is particularly useful in situations where the number of predictors (i.e.&nbsp;independent variables) is much larger than the sample size.</p>
<p>We will investigate the frequentist solution of using a two-stage solution with LASSO regression via <code>glmnet</code> and then using the <code>selectiveInference</code> package to perform post inference and adjust for the bias introduced by the selection process. We will also investigate a Bayesian solution that approximates a LASSO regression via a Laplace prior.</p>
</section>
<section id="simulating-a-dataset" class="level2">
<h2 class="anchored" data-anchor-id="simulating-a-dataset">Simulating a Dataset</h2>
<p>We will simulate a dataset with <code>n = 5000</code> people and <code>p = 50</code> variables, where only three of the 50 variables will have an association with the binary outcome, and they will have odds ratios of 4, 3, and 2. The remaining variables will have no association with the outcome.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(data.table)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(magrittr)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb1-6">n <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span></span>
<span id="cb1-7">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb1-8">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> p), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> n)</span>
<span id="cb1-9">beta <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">47</span>))</span>
<span id="cb1-10">prob <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plogis</span>(x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> beta)</span>
<span id="cb1-11">y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(n, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, prob)</span>
<span id="cb1-12"></span>
<span id="cb1-13">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cbind</span>(y,x))</span>
<span id="cb1-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(data) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"y"</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"x"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ncol</span>(x)))</span>
<span id="cb1-15"></span>
<span id="cb1-16">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">model.matrix</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data)[,<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb1-17">y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>y</span></code></pre></div>
</div>
</section>
<section id="lasso-regression" class="level2">
<h2 class="anchored" data-anchor-id="lasso-regression">LASSO Regression</h2>
<p>We will now fit a LASSO regression model using the <code>glmnet</code> package in R. LASSO is a popular method for feature selection in high-dimensional data, where the number of predictors <code>p</code> is much larger than the number of observations <code>n</code>.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get standard deviation of X, because we will need to standardize/scale it outside of glmnet</span></span>
<span id="cb2-2">sds <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(x, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, sd)</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># standardize x</span></span>
<span id="cb2-5">x_scaled <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale</span>(x,<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>,<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># run glmnet</span></span>
<span id="cb2-8">cfit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> glmnet<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cv.glmnet</span>(x_scaled,y,<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">standardize=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"binomial"</span>)</span>
<span id="cb2-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(cfit)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>51 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) 0.008303975
x1          1.081745656
x2          0.821477348
x3          0.452041341
x4          .          
x5          .          
x6          .          
x7          .          
x8          .          
x9          .          
x10         .          
x11         .          
x12         .          
x13         .          
x14         .          
x15         .          
x16         .          
x17         .          
x18         .          
x19         .          
x20         .          
x21         .          
x22         .          
x23         .          
x24         .          
x25         .          
x26         .          
x27         .          
x28         .          
x29         .          
x30         .          
x31         .          
x32         .          
x33         .          
x34         .          
x35         .          
x36         .          
x37         .          
x38         .          
x39         .          
x40         .          
x41         .          
x42         .          
x43         .          
x44         .          
x45         .          
x46         .          
x47         .          
x48         .          
x49         .          
x50         .          </code></pre>
</div>
</div>
</section>
<section id="no-confidence-intervals" class="level2 callout-info">
<h2 class="anchored" data-anchor-id="no-confidence-intervals">No confidence intervals</h2>
<p>Note that LASSO regression does not provide any confidence intervals or p-values, only coeficient estimates.</p>
</section>
<section id="inference" class="level3">
<h3 class="anchored" data-anchor-id="inference">Inference</h3>
<p>LASSO regression is a popular method for variable selection in high-dimensional datasets. It shrinks some coefficients to zero, allowing us to select only a subset of variables that have the strongest association with the outcome. However, LASSO does not provide confidence intervals or p-values for the selected variables.</p>
<p>The reason for this is that LASSO performs variable selection by penalizing the likelihood function, not by explicitly testing the significance of each variable. Therefore, we cannot use traditional methods to compute confidence intervals or p-values. Instead, we need to use methods that are specifically designed for post-selection inference.</p>
<p>One such method is provided in the R package <code>selectiveInference.</code> The function <code>fixedLassoInf</code> provides confidence intervals and p-values for LASSO selected variables by accounting for the fact that variable selection was performed. It does this by using a two-stage procedure. In the first stage, LASSO selects a subset of variables. In the second stage, selectiveInference performs inference on the selected variables, adjusting for the selection procedure.</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Remember to account for decisions taken at all stages of your modelling process.
</div>
</div>
<div class="callout-body-container callout-body">
<p>It is important to use <code>selectiveInference</code> rather than naively using simpler (but incorrect) methods that do not take into account the two-stage process.</p>
<p>A simpler (but incorrect) method involves first selecting variables with LASSO and then fitting a traditional logistic regression on the selected variables. However, this can lead to biased estimates because the LASSO selection process ignores the uncertainty in the variable selection. Therefore, the second-stage regression will not account for the fact that the variable selection was performed, leading to over-optimistic estimates of the significance of the selected variables.</p>
<p>By using <code>selectiveInference</code>, we can properly account for the selection process and obtain unbiased estimates of the significance of the selected variables.</p>
</div>
</div>
<div class="cell">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># compute fixed lambda p-values and selection intervals</span></span>
<span id="cb4-2">out <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> selectiveInference<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fixedLassoInf</span>(</span>
<span id="cb4-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> x_scaled,</span>
<span id="cb4-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> y,</span>
<span id="cb4-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">beta =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(cfit),</span>
<span id="cb4-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lambda =</span> cfit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>lambda<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">.1</span>se,</span>
<span id="cb4-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>,</span>
<span id="cb4-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"binomial"</span></span>
<span id="cb4-9">)</span>
<span id="cb4-10"></span>
<span id="cb4-11">retval <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb4-12">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">var =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>vars),</span>
<span id="cb4-13">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Odds_Ratio =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>coef0<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>sds[out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>vars]),</span>
<span id="cb4-14">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">LowConf =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ci[,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>sds[out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>vars]),</span>
<span id="cb4-15">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">UpperConf =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ci[,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>sds[out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>vars]),</span>
<span id="cb4-16">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pval =</span> out<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pv</span>
<span id="cb4-17">)</span>
<span id="cb4-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(retval) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span></span>
<span id="cb4-19">retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>var[retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pval <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*"</span>, retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>var[retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pval <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>])</span>
<span id="cb4-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(retval) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb4-21">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Variable"</span>,</span>
<span id="cb4-22">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Odds ratio"</span>,</span>
<span id="cb4-23">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Conf int 5%"</span>,</span>
<span id="cb4-24">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Conf int 95%"</span>,</span>
<span id="cb4-25">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pvalue"</span></span>
<span id="cb4-26">)</span>
<span id="cb4-27">retval</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>  Variable Odds ratio Conf int 5% Conf int 95% Pvalue
1      *x1   3.761571    3.471404     4.073621      0
2      *x2   2.793509    2.594666     3.007784      0
3      *x3   1.848946    1.727840     1.979062      0</code></pre>
</div>
</div>
</section>
<section id="bayesian-logistic-regression-using-rstanarm" class="level2">
<h2 class="anchored" data-anchor-id="bayesian-logistic-regression-using-rstanarm">Bayesian Logistic Regression using <code>rstanarm</code></h2>
<p>Another way to perform inference on a logistic regression model with feature selection is through Bayesian methods. In particular, we can use the <code>rstanarm</code> R package to fit a Bayesian logistic regression model with a Laplace prior.</p>
</section>
<section id="laplace-prior" class="level2 callout-info">
<h2 class="anchored" data-anchor-id="laplace-prior">Laplace prior</h2>
<p>The Laplace prior is used to promote sparsity by assigning a probability distribution to the coefficients that puts more probability mass around zero. It is equivalent to LASSO regression <span class="citation" data-cites="Tibshirani1996">(Tibshirani 1996)</span>.</p>
</section>
<div class="cell">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">options</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mc.cores =</span> parallel<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">detectCores</span>())</span>
<span id="cb6-2"></span>
<span id="cb6-3">fit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> rstanarm<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stan_glm</span>(</span>
<span id="cb6-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> .,</span>
<span id="cb6-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data,</span>
<span id="cb6-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">binomial</span>(),</span>
<span id="cb6-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prior =</span> rstanarm<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">laplace</span>(),</span>
<span id="cb6-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">chains =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb6-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">iter =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>,</span>
<span id="cb6-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">refresh=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb6-11">)</span>
<span id="cb6-12"></span>
<span id="cb6-13">retval <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb6-14">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">var =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(fit)),</span>
<span id="cb6-15">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Odds_Ratio =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(fit)),<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb6-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(rstanarm<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">posterior_interval</span>(fit, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prob =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>)),<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb6-17">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pvalue_equivalent =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(bayestestR<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pd_to_p</span>(bayestestR<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">p_direction</span>(fit)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pd),<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb6-18">)</span>
<span id="cb6-19"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(retval) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span></span>
<span id="cb6-20">retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>var[retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pvalue_equivalent <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*"</span>, retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>var[retval<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pvalue_equivalent <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>])</span>
<span id="cb6-21"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(retval) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb6-22">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Variable"</span>,</span>
<span id="cb6-23">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Odds ratio"</span>,</span>
<span id="cb6-24">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cred int 5%"</span>,</span>
<span id="cb6-25">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cred int 95%"</span>,</span>
<span id="cb6-26">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pvalue equivalent"</span></span>
<span id="cb6-27">)</span>
<span id="cb6-28">retval</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>      Variable Odds ratio Cred int 5% Cred int 95% Pvalue equivalent
1  (Intercept)      1.009       0.949        1.073              0.81
2          *x1      4.049       3.744        4.385              0.00
3          *x2      2.968       2.767        3.193              0.00
4          *x3      1.932       1.815        2.058              0.00
5           x4      0.953       0.899        1.011              0.18
6           x5      1.018       0.961        1.079              0.62
7           x6      0.999       0.942        1.060              0.98
8           x7      0.984       0.927        1.044              0.66
9           x8      0.974       0.918        1.034              0.46
10          x9      0.961       0.904        1.019              0.26
11         x10      0.998       0.941        1.058              0.96
12         x11      1.042       0.983        1.104              0.26
13         x12      0.962       0.908        1.020              0.29
14         x13      1.039       0.979        1.103              0.29
15         x14      1.004       0.944        1.066              0.91
16         x15      0.998       0.940        1.059              0.96
17         x16      0.960       0.905        1.018              0.25
18         x17      1.031       0.972        1.095              0.39
19         x18      0.958       0.901        1.018              0.25
20         x19      0.993       0.936        1.054              0.84
21         x20      1.007       0.949        1.069              0.86
22         x21      0.984       0.930        1.042              0.65
23         x22      1.042       0.984        1.106              0.25
24         x23      0.958       0.902        1.017              0.24
25         x24      0.996       0.938        1.057              0.92
26         x25      1.040       0.980        1.105              0.28
27         x26      0.980       0.925        1.039              0.57
28         x27      1.038       0.980        1.101              0.29
29         x28      1.013       0.958        1.072              0.71
30         x29      1.024       0.964        1.085              0.52
31         x30      1.007       0.951        1.068              0.84
32        *x31      0.904       0.851        0.961              0.00
33         x32      1.026       0.968        1.085              0.47
34         x33      1.032       0.972        1.097              0.39
35         x34      1.001       0.943        1.063              0.99
36         x35      1.007       0.950        1.065              0.85
37         x36      1.049       0.989        1.113              0.18
38         x37      1.023       0.965        1.086              0.52
39         x38      1.060       1.000        1.123              0.10
40         x39      1.022       0.961        1.085              0.56
41         x40      1.030       0.969        1.093              0.41
42         x41      0.985       0.926        1.045              0.67
43         x42      1.007       0.949        1.067              0.85
44         x43      1.026       0.967        1.089              0.47
45         x44      0.947       0.892        1.005              0.13
46         x45      1.045       0.983        1.109              0.24
47         x46      1.012       0.953        1.075              0.74
48         x47      1.005       0.947        1.066              0.90
49         x48      0.954       0.899        1.011              0.18
50         x49      0.960       0.904        1.017              0.25
51         x50      0.969       0.912        1.027              0.38</code></pre>
</div>
</div>
<section id="p-value-equivalent" class="level2 callout-info">
<h2 class="anchored" data-anchor-id="p-value-equivalent">P-value equivalent</h2>
<p>Probability of Direction (PoD) and p-values are both statistical measures used in hypothesis testing <span class="citation" data-cites="Makowski2019bayestestR">Makowski et al. (2019)</span>. They are similar in that they both provide evidence for or against a null hypothesis. PoD measures the proportion of posterior draws from a Bayesian model that are in the direction of the alternative hypothesis. It provides a measure of the strength of evidence for the alternative hypothesis relative to the null hypothesis. A high PoD value indicates strong evidence in favor of the alternative hypothesis, while a low PoD value indicates weak evidence in favor of the alternative hypothesis.</p>
<p>Similarly, a p-value measures the probability of obtaining a test statistic as extreme as or more extreme than the observed value, assuming that the null hypothesis is true. A low p-value indicates that the observed result is unlikely to have occurred by chance alone, providing evidence against the null hypothesis.</p>
<p>To convert PoD to a p-value equivalent, one approach is to use the following formula:</p>
<p>p-value = 2 * min(PoD, 1-PoD)</p>
<p>This formula assumes a two-tailed test and converts the PoD to a p-value for a test of the null hypothesis that the effect size is equal to zero. The resulting p-value can be interpreted as the probability of obtaining the observed result or a more extreme result under the null hypothesis.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The blog article discusses the limitations of using LASSO (Least Absolute Shrinkage and Selection Operator) models for statistical inference, particularly in situations where the number of predictors (i.e.&nbsp;independent variables) is much larger than the sample size. In these cases, LASSO models can suffer from high variability in the estimated coefficients, which can lead to incorrect or unreliable conclusions.</p>
<p>One proposed solution to this problem is to use a two-stage inference approach, where LASSO is first used to select a subset of predictors, and then a separate statistical method (such as ordinary least squares) is used to estimate the coefficients for the selected predictors. However, this two-stage approach can also have limitations, such as a loss of power in the second stage and increased computational complexity.</p>
<p>In contrast, Bayesian statistics offer a one-stage inference approach that can provide more reliable and interpretable results in complex modeling situations. Bayesian statistics allow for the incorporation of prior knowledge and uncertainty in the model, which can help to reduce variability and improve accuracy. Bayesian methods also provide a framework for model comparison and selection, which can help to identify the most appropriate model for a given dataset.</p>
<p>Overall, while LASSO models can be useful in certain situations, their limitations in high-dimensional data settings highlight the advantages of Bayesian statistics for reliable and interpretable statistical inference.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-Makowski2019Indices" class="csl-entry">
Makowski, Dominique, Matthew S Ben-Shachar, Simon H A Chen, and Daniel Lüdecke. 2019. <span>“Indices of Effect Existence and Significance in the Bayesian Framework.”</span> <em>Frontiers in Psychology</em> 10: 2767. <a href="https://doi.org/10.3389/fpsyg.2019.02767">https://doi.org/10.3389/fpsyg.2019.02767</a>.
</div>
<div id="ref-Makowski2019bayestestR" class="csl-entry">
Makowski, Dominique, Matthew S Ben-Shachar, and Daniel Lüdecke. 2019. <span>“bayestestR: Describing Effects and Their Uncertainty, Existence and Significance Within the Bayesian Framework.”</span> <em>Journal of Open Source Software</em> 4 (40): 1541. <a href="https://doi.org/10.21105/joss.01541">https://doi.org/10.21105/joss.01541</a>.
</div>
<div id="ref-Tibshirani1996" class="csl-entry">
Tibshirani, Robert. 1996. <span>“Regression Shrinkage and Selection via the Lasso.”</span> <em>Journal of the Royal Statistical Society. Series B (Methodological)</em> 58 (1): 267–88. <a href="http://www.jstor.org/stable/2346178">http://www.jstor.org/stable/2346178</a>.
</div>
</div></section></div> ]]></description>
  <guid>https://papadopoulos-lab.github.io/post/2023-04-14-sparse-inference/sparse-inference.html</guid>
  <pubDate>Thu, 13 Apr 2023 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Sampling Bias</title>
  <dc:creator>Richard Aubrey White</dc:creator>
  <link>https://papadopoulos-lab.github.io/post/2023-02-23-sampling-bias/sampling-bias.html</link>
  <description><![CDATA[ 





<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(data.table)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(magrittr)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span></code></pre></div>
</div>
<section id="what-is-sampling-bias" class="level2">
<h2 class="anchored" data-anchor-id="what-is-sampling-bias">What is Sampling Bias?</h2>
<p>Sampling bias refers to the phenomenon of a biased sample being used in a study that does not accurately represent the population being studied. This can happen in a number of ways, such as through selection bias, survivorship bias, or measurement bias. When sampling bias is present, it can lead to inaccurate results, incorrect estimates of associations between variables, and incorrect conclusions. This, in turn, can have an impact on public policy decisions, research funding, and clinical practice.</p>
</section>
<section id="types-of-sampling-bias" class="level2">
<h2 class="anchored" data-anchor-id="types-of-sampling-bias">Types of Sampling Bias</h2>
<p>There are several types of sampling bias, including:</p>
<section id="selection-bias" class="level3">
<h3 class="anchored" data-anchor-id="selection-bias">1. Selection Bias</h3>
<p>Selection bias occurs when the selection of study participants is not random or representative of the larger population. This can happen when certain groups are excluded or overrepresented, leading to inaccurate conclusions about the study population.</p>
<p>For example, if a study only recruits participants from a single geographic region, the results may not be generalizable to the larger population. Similarly, if a study only recruits individuals with a certain health condition, the results may not accurately reflect the general population.</p>
</section>
<section id="survivorship-bias" class="level3">
<h3 class="anchored" data-anchor-id="survivorship-bias">2. Survivorship Bias</h3>
<p>Survivorship bias occurs when only the surviving members of a population are included in a study. This can lead to inaccurate conclusions about the population, as those who did not survive may have had different characteristics or experiences.</p>
<p>For example, if a study only includes individuals who survived a specific disease, the results may not be generalizable to the larger population of individuals who did not survive.</p>
</section>
<section id="measurement-bias" class="level3">
<h3 class="anchored" data-anchor-id="measurement-bias">3. Measurement Bias</h3>
<p>Measurement bias occurs when the measurement instruments or techniques used in a study are inaccurate or unreliable. This can result in inaccurate data and misinterpretation of results.</p>
<p>For example, if a study relies on self-reported data, individuals may underreport or overreport certain behaviors, leading to inaccurate conclusions about the study population. Similarly, if a study uses different measurement techniques for different groups, the results may not be comparable and may lead to inaccurate conclusions.</p>
</section>
</section>
<section id="example-of-sampling-bias-in-a-study" class="level2">
<h2 class="anchored" data-anchor-id="example-of-sampling-bias-in-a-study">Example of Sampling Bias in a Study</h2>
<p>To better understand the impact of sampling bias on study results, let’s take a look at an example.</p>
<p>Suppose we want to study the relationship between smoking and lung function. We know that in our city there are 100,000 people, 20,000 of whom are smokers. To our study we recruit 5,000 smokers and 5,000 non-smokers (oversampling the smokers, a type of selection bias). We also collect data on how frequently they exercise, whether they have good genes for lung function, and whether they frequently wear hats.</p>
<p>We now want to overcome our selection bias when assessing the association between the outcome of lung function and the exposures of exercise, good genes, and the frequency of hat wearing.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb2-2"></span>
<span id="cb2-3">d <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.table</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100000</span>)</span>
<span id="cb2-4">d[, is_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(.N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>)]</span>
<span id="cb2-5">d[, probability_of_exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(is_smoker<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span>T, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>)]</span>
<span id="cb2-6">d[, exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(.N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, probability_of_exercises_frequently)]</span>
<span id="cb2-7">d[, has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(.N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>)]</span>
<span id="cb2-8">d[, wears_hats_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(.N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>)]</span>
<span id="cb2-9"></span>
<span id="cb2-10">d[, lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> is_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(.N, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)]</span>
<span id="cb2-11"></span>
<span id="cb2-12">d[, probability_of_selection_uniform <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>.N]</span>
<span id="cb2-13"></span>
<span id="cb2-14">d[, probability_of_selection_oversample_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(is_smoker<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span>T, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)]</span>
<span id="cb2-15">d[, probability_of_selection_oversample_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> probability_of_selection_oversample_smoker<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(probability_of_selection_oversample_smoker)]</span>
<span id="cb2-16"></span>
<span id="cb2-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We have a dataset with oversampled smokers</span></span>
<span id="cb2-18">d_oversampled_smokers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> d[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>.N, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prob =</span> probability_of_selection_oversample_smoker)]</span>
<span id="cb2-19">(weight_smoker <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(d<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>is_smoker)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(d_oversampled_smokers<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>is_smoker))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0.3700516</code></pre>
</div>
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">(weight_non_smoker <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>d<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>is_smoker)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>d_oversampled_smokers<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>is_smoker))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1.747289</code></pre>
</div>
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">d_oversampled_smokers[, weights <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(is_smoker<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span>T, weight_smoker, weight_non_smoker)]</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The real associations:</span></span>
<span id="cb6-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># is_smoker: -10 (also associated with exercises_frequently!)</span></span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exercises_frequently: +5 (also associated with is_smoker!)</span></span>
<span id="cb6-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># has_good_genes: +8 (only associated with outcome, not with other exposures)</span></span>
<span id="cb6-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wears_hats_frequently: 0 (not associated with outcome nor other exposures)</span></span>
<span id="cb6-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> is_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> wears_hats_frequently, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>d))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = lung_function ~ is_smoker + exercises_frequently + 
    has_good_genes + wears_hats_frequently, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6549  -2.0112   0.0055   2.0045  12.1835 

Coefficients:
                       Estimate Std. Error  t value Pr(&gt;|t|)    
(Intercept)            30.02288    0.01421 2112.751   &lt;2e-16 ***
is_smoker             -10.05071    0.02427 -414.150   &lt;2e-16 ***
exercises_frequently    4.99532    0.02250  221.992   &lt;2e-16 ***
has_good_genes          7.98073    0.02355  338.831   &lt;2e-16 ***
wears_hats_frequently  -0.01537    0.02360   -0.651    0.515    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.989 on 99995 degrees of freedom
Multiple R-squared:  0.7975,    Adjusted R-squared:  0.7975 
F-statistic: 9.847e+04 on 4 and 99995 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># When we run the model in the full data, excluding is_smoker, we get the following associations:</span></span>
<span id="cb8-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exercises_frequently: +7.2 (biased from association with is_smoker)</span></span>
<span id="cb8-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># has_good_genes: +8 (not biased)</span></span>
<span id="cb8-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wears_hats_frequently: 0 (not biased)</span></span>
<span id="cb8-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> wears_hats_frequently, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>d))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = lung_function ~ exercises_frequently + has_good_genes + 
    wears_hats_frequently, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.056  -2.670   0.928   3.445  14.532 

Coefficients:
                       Estimate Std. Error  t value Pr(&gt;|t|)    
(Intercept)           2.747e+01  2.109e-02 1302.327   &lt;2e-16 ***
exercises_frequently  7.172e+00  3.605e-02  198.941   &lt;2e-16 ***
has_good_genes        7.958e+00  3.881e-02  205.044   &lt;2e-16 ***
wears_hats_frequently 4.393e-04  3.888e-02    0.011    0.991    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.926 on 99996 degrees of freedom
Multiple R-squared:  0.4502,    Adjusted R-squared:  0.4502 
F-statistic: 2.73e+04 on 3 and 99996 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># When we run the model in the biased data, with oversampling of smokers (that has also an association with the outcome):</span></span>
<span id="cb10-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exercises_frequently: +9.8 (biased from association with is_smoker and the biased sampling)</span></span>
<span id="cb10-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># has_good_genes: +7.6 (not biased)</span></span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wears_hats_frequently: +0.3 (not biased)</span></span>
<span id="cb10-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> wears_hats_frequently, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>d_oversampled_smokers))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = lung_function ~ exercises_frequently + has_good_genes + 
    wears_hats_frequently, data = d_oversampled_smokers)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.4131  -4.3613  -0.6571   4.4586  17.3734 

Coefficients:
                      Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)            23.7436     0.1014 234.191   &lt;2e-16 ***
exercises_frequently    9.8631     0.2140  46.095   &lt;2e-16 ***
has_good_genes          7.6164     0.1967  38.726   &lt;2e-16 ***
wears_hats_frequently   0.3363     0.1905   1.765   0.0776 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.501 on 4996 degrees of freedom
Multiple R-squared:  0.4221,    Adjusted R-squared:  0.4218 
F-statistic:  1217 on 3 and 4996 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Run the model in the biased data, with weights:</span></span>
<span id="cb12-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exercises_frequently: +7.4 (biased from association with is_smoker)</span></span>
<span id="cb12-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># has_good_genes: +7.6 (not biased)</span></span>
<span id="cb12-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wears_hats_frequently: +0.3 (not biased)</span></span>
<span id="cb12-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> wears_hats_frequently, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>d_oversampled_smokers, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weights =</span> weights))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = lung_function ~ exercises_frequently + has_good_genes + 
    wears_hats_frequently, data = d_oversampled_smokers, weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-12.537  -4.885  -2.767   2.077  18.233 

Coefficients:
                      Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)           27.33919    0.09322 293.291   &lt;2e-16 ***
exercises_frequently   7.40799    0.16055  46.140   &lt;2e-16 ***
has_good_genes         7.60104    0.17490  43.459   &lt;2e-16 ***
wears_hats_frequently  0.26672    0.16734   1.594    0.111    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.859 on 4996 degrees of freedom
Multiple R-squared:   0.45, Adjusted R-squared:  0.4497 
F-statistic:  1363 on 3 and 4996 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Run the model in the biased data, with is_smoker:</span></span>
<span id="cb14-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># is_smoker: -9.9 (not biased)</span></span>
<span id="cb14-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exercises_frequently: +5.3 (not biased)</span></span>
<span id="cb14-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># has_good_genes: +7.8 (not biased)</span></span>
<span id="cb14-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wears_hats_frequently: +0.2 (not biased)</span></span>
<span id="cb14-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(lung_function <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> is_smoker <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> exercises_frequently <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> has_good_genes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> wears_hats_frequently, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>d_oversampled_smokers))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = lung_function ~ is_smoker + exercises_frequently + 
    has_good_genes + wears_hats_frequently, data = d_oversampled_smokers)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6202  -1.9903  -0.0699   1.9855  11.1052 

Coefficients:
                      Estimate Std. Error  t value Pr(&gt;|t|)    
(Intercept)           29.83110    0.07786  383.149   &lt;2e-16 ***
is_smoker             -9.88043    0.08978 -110.051   &lt;2e-16 ***
exercises_frequently   5.25052    0.12300   42.688   &lt;2e-16 ***
has_good_genes         7.79706    0.10630   73.350   &lt;2e-16 ***
wears_hats_frequently  0.15569    0.10296    1.512    0.131    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.973 on 4995 degrees of freedom
Multiple R-squared:  0.8313,    Adjusted R-squared:  0.8311 
F-statistic:  6152 on 4 and 4995 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Conclusion: Biased datasets can be corrected for by either:</p>
<ul>
<li>Sample weights</li>
<li>Including the sampling variables as covariates in the regression model</li>
</ul>


</section>

 ]]></description>
  <guid>https://papadopoulos-lab.github.io/post/2023-02-23-sampling-bias/sampling-bias.html</guid>
  <pubDate>Wed, 22 Feb 2023 23:00:00 GMT</pubDate>
</item>
</channel>
</rss>
