Sort of OK first draft

JohnnyDoorn · Oct 19, 2023 · be95d5f · be95d5f
1 parent 01f8814
commit be95d5f
Show file tree

Hide file tree

Showing 58 changed files with 384 additions and 58 deletions.
diff --git a/02-models.qmd b/02-models.qmd
@@ -2,9 +2,9 @@
 
 Before we dive into the analysis of the beer tasting experiment, we need to define some key components. First of all, the concept of a *statistical model*. A statistical model is a combination of a general statistical model (e.g., the binomial model) and a statement about a parameter value that describe a certain phenomenon. For instance, we can model the flipping of a fair coin with the binomial model, where the probability parameter $\theta$ ("theta") is set to $0.5$. Or, we can model the height of Dutch men (in cm) with a normal model, where the location parameter $\mu = 183$ and the dispersion parameter $\sigma = 5$. A statistical model can therefore also be seen as a hypothesis: a specific statement about the value of the model parameters.
 
-## Models Make Predictions
+## Models Make Predictions {#sec-models-make-predictions}
 
-An essential property of a statistical model is that it can make predictions about the real world. We can use the accuracy of these predictions to gauge the quality/plausibility of a model, relative to another model. For instance, Sarah thinks that the probability of heads in a coin flip is 50% (i.e., $H_S: \theta = 0.5$), while Paul claims that the coin has been tampered with, and that the probability of heads is 80% (i.e., $H_P: \theta = 0.8$). Here, Sarah and Paul postulate different models/hypotheses. They are both [binomial models](https://en.wikipedia.org/wiki/Binomial_distribution), which is the general statistical model for describing a series of chance-based events with a binary outcome (e.g., coin flip, red/black in roulette, whether a random person from the population has a certain disease or not, or someone identifying the alcholic beer). Where Sarah and Paul differ, however, is their claim about the specific value of the $\theta$ parameter. In the remainder of this text, we will be refering to model to mean such a combination of general statistical model, and claim about the value of the model parameter (i.e., hypothesis).
+An essential property of a statistical model is that it can make predictions about the real world. We can use the accuracy of these predictions to gauge the quality/plausibility of a model, relative to another model. For instance, Sarah thinks that the probability of heads in a coin flip is 50% (i.e., $H_S: \theta = 0.5$), while Paul claims that the coin has been tampered with, and that the probability of heads is 80% (i.e., $H_P: \theta = 0.8$). Here, Sarah and Paul postulate different models/hypotheses. They are both [binomial models](https://en.wikipedia.org/wiki/Binomial_distribution), which is the general statistical model for describing a series of chance-based events with a binary outcome (e.g., coin flip, red/black in roulette, whether a random person from the population has a certain disease or not, or someone identifying the alcholic beer). Where Sarah and Paul differ, however, is their claim about the specific value of the $\theta$ parameter. In the remainder of this text, we will be referring to model to mean such a combination of general statistical model, and claim about the value of the model parameter (i.e., hypothesis).
 
 ```{r two-models-binomial, fig.cap='Two models for a coin toss. The arrows indicate what each of the models postulate: both postulate a single value for theta.', fig.align='center', out.width='90%', echo = FALSE, cache=FALSE}
 #| label: fig-two-models-binomial

diff --git a/03-estimation.qmd b/03-estimation.qmd
@@ -2,15 +2,13 @@
 
 In the previous chapter, we saw that people/models can have different beliefs/hypotheses about a phenomenon. Sarah was quite certain that the true probability of the coin landing heads was $0.5$, whereas David believed that the coin was biased towards heads, assigning more mass to higher values of $\theta$. We saw how we can observe data to test which of the models predicted this data the best, using the Bayes factor. In this chapter, we will look at how individual models update their beliefs, as a result of observing the data. In doing so, this chapter will illustrate the core Bayesian concept of starting with **prior** knowledge/beliefs, updating this knowledge with observed data, to end up with **posterior** knowledge/beliefs about a parameter.
 
-The following formula reflects this process: 
-\begin{align}
+The following formula reflects this process: \begin{align}
 \label{eq-binomial-estimation}
 \underbrace{ p(\theta \mid \text{data})}_{\substack{\text{Posterior beliefs}}} \,\,\, = \,\,\,
 \underbrace{ p(\theta)}_{\substack{\text{Prior beliefs} }}
 \,\,\,\, \times
 \overbrace{\underbrace{\frac{p( \text{data} \mid \theta)}{p( \text{data})}}}^{\substack{\text{Prediction for specific }\theta }}_{\substack{\text{Average prediction} \\\text{across all }  \theta's}}.
-\end{align} 
-We have prior beliefs, which are updated by an **updating factor**, to form posterior beliefs. The updating factor indicates for each possible value of $\theta$, how well it predicted the observed data, relative to all other possible values of $\theta$. If this is still sounding rather vague, don't worry - in this section we will demonstrate how this updating factor operates. However, before we discuss the updating factor, we go back to the start, and discuss the concept of prior knowledge/beliefs.
+\end{align} We have prior beliefs, which are updated by an **updating factor**, to form posterior beliefs. The updating factor indicates for each possible value of $\theta$, how well it predicted the observed data, relative to all other possible values of $\theta$. If this is still sounding rather vague, don't worry - in this section we will demonstrate how this updating factor operates. However, before we discuss the updating factor, we go back to the start, and discuss the concept of prior knowledge/beliefs.
 
 ```{r, echo = FALSE}
 tBetFun <- function(x, shape1 = 1, shape2 = 1, side = "pos") {
@@ -73,9 +71,9 @@ As you can see, the likelihood is the greatest for $\theta = 0.8$. This makes se
 
 It is important to note that the likelihood is **not** a probability distribution: its surface area does not sum to 1, and we therefore cannot use it to make probabilistic statements about the parameter (we can use the posterior distribution for this, at the end of this section).
 
-### The Marginal Likelihood
+### The Marginal Likelihood {#sec-marginal-likelihood}
 
-Now we can take a look at the second part of the updating factor: $p( \text{data})$. This part is known as the **marginal likelihood**. In contrast to the first part, the marginal likelihood is a single number. Namely, it is the average of all the likelihoods, where the likelihood of each value is weighted by the prior belief placed on that value by the model. This is the same procedure as in the previous chapter (see @sec-more-models). In fact, for all of the models, the marginal likelihood was indicated by the yellow bar: it is the likelihood of the observed data, weighted by each model's specific beliefs. In the case of Alex' model, the prior belief is equal across all values, so the marginal likelihood is "simply" the average likelihood.[^03-estimation-3] In this case, the marginal likelihood is equal to `r round(1/11, 4)` - precisely the height of the yellow bar in  @fig-uninformed-model-binomial-prediction.
+Now we can take a look at the second part of the updating factor: $p( \text{data})$. This part is known as the **marginal likelihood**. In contrast to the first part, the marginal likelihood is a single number. Namely, it is the average of all the likelihoods, where the likelihood of each value is weighted by the prior belief placed on that value by the model. This is the same procedure as in the previous chapter (see @sec-more-models). In fact, for all of the models, the marginal likelihood was indicated by the yellow bar: it is the likelihood of the observed data, weighted by each model's specific beliefs. In the case of Alex' model, the prior belief is equal across all values, so the marginal likelihood is "simply" the average likelihood.[^03-estimation-3] In this case, the marginal likelihood is equal to `r round(1/11, 4)` - precisely the height of the yellow bar in @fig-uninformed-model-binomial-prediction.
 
 [^03-estimation-3]: Technical sidenote: *integrating* over the likelihood will give the marginal likelihood for Alex.
 
@@ -186,7 +184,7 @@ Since Sarah's prior density for any value that is not 0.5, is equal to 0, that m
 
 [^03-estimation-6]: I feel like there is a real-life analogue to be found here somewhere...
 
-Let's define the marginal likelihood once more: Sarah has prior beliefs about $\theta$, reflected by the prior mass in  @fig-sarah-model-binomial. Each of these values has a certain match with the observed data (i.e., the likelihood). The marginal likelihood is then the average of all those likelihoods, weighted by the prior mass assigned. This weighting by prior mass makes each model's marginal likelihood different from each other, because each has their own unique prior beliefs. If we were to look at the updating factor for Sarah, we would see that Sarah's marginal likelihood is simply equal to the likelihood of the data for $\theta = 0.5$ (i.e, $P(\text{data} \mid \theta = 0.5) =$ `r round(dbinom(8, 10, 0.5), 4)`) because that is the only value that Sarah assigned any prior mass to.
+Let's define the marginal likelihood once more: Sarah has prior beliefs about $\theta$, reflected by the prior mass in @fig-sarah-model-binomial. Each of these values has a certain match with the observed data (i.e., the likelihood). The marginal likelihood is then the average of all those likelihoods, weighted by the prior mass assigned. This weighting by prior mass makes each model's marginal likelihood different from each other, because each has their own unique prior beliefs. If we were to look at the updating factor for Sarah, we would see that Sarah's marginal likelihood is simply equal to the likelihood of the data for $\theta = 0.5$ (i.e, $P(\text{data} \mid \theta = 0.5) =$ `r round(dbinom(8, 10, 0.5), 4)`) because that is the only value that Sarah assigned any prior mass to.
 
 ### David's Learning Process
 
@@ -218,7 +216,7 @@ samps <- c(0:5, rbinom(1e5, 10, samps))
 davMarLik <- (table(samps)/1e5)[9]
 ```
 
-Just as before, we start with the prior beliefs, update these with the updating factor (which values in the model predicted the data better/worse than average?), to form posterior beliefs. The difference between the different models lies in their different starting points (prior beliefs), but also their updating factor differs. Specifically, the likelihood stays the same (@fig-likelihood-binomial-8-heads), but the marginal likelihood differs. As we see in  @fig-two-models-binomial-onesided-predictions, David's marginal likelihood for the outcome of 8 heads is approximately `r round(davMarLik, 3)`.
+Just as before, we start with the prior beliefs, update these with the updating factor (which values in the model predicted the data better/worse than average?), to form posterior beliefs. The difference between the different models lies in their different starting points (prior beliefs), but also their updating factor differs. Specifically, the likelihood stays the same (@fig-likelihood-binomial-8-heads), but the marginal likelihood differs. As we see in @fig-two-models-binomial-onesided-predictions, David's marginal likelihood for the outcome of 8 heads is approximately `r round(davMarLik, 3)`.
 
 In the previous chapter we saw how we can use the ratio of the marginal likelihoods of two models (i.e., the Bayes factor) to do model comparison. In this chapter we focus on individual models and how they estimate/learn about the parameter. In that context we use the marginal likelihood to see which parameter values in the model predicted the data better/worse than average, in order to see which values receive a boost/penalty in plausibility. The figure below illustrates this mechanism - note that the likelihood is exactly the same as for other models, but the marginal likelihood (indicated by the purple bar) is different.
 

diff --git a/05-more-tests.qmd b/05-more-tests.qmd
@@ -17,7 +17,7 @@ The first question, "do people find alcoholic beer tastier?", concerns the diffe
 
 <!-- ``` -->
 
-### Prior Distribution
+### Prior Distribution {#sec-prior-ttest}
 
 The prior distribution is always defined on the same domain as the parameter of interest. For the proportion, this was the convenient domain of \[0, 1\], and so allowed the use of the beta distribution, and the possibility to use the uniform distribution as the uninformed prior distribution. The domain of $\delta$ instead goes from $-\infty$ to $\infty$, so its prior distribution has to match that domain. For a null hypothesis this does not matter so much, since the null hypothesis generally posits a single value (e.g., 0, stating there is no difference between the groups). However, for the alternative hypothesis it becomes tricky now to have a uniform distribution on the whole domain of $\delta$. Since the domain is infinitely big, the density of a uniform distribution between $-\infty$ and $\infty$ would need to be infinitely small, which is not very practical. Instead, what is generally done is to apply a probability distribution that is spread out a little less (although still a lot more than a point null hypothesis). One such distribution is the [Cauchy distribution](https://en.wikipedia.org/wiki/Cauchy_distribution), which is a $t$ distribution with a single degree of freedom. The width of the Cauchy distribution is determined by the **Cauchy scale parameter**. Below, several examples are given: