Differences between prior distribution and prior predictive distribution?Intersections of chemistry and...
How to make ice magic work from a scientific point of view?
Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?
A curious equality of integrals involving the prime counting function?
What is the wife of a henpecked husband called?
How do you funnel food off a cutting board?
Is Krishna the only avatar among dashavatara who had more than one wife?
Which communication protocol is used in AdLib sound card?
Why do cars have plastic shrouds over the engine?
Constexpr if with a non-bool condition
Potential client has a problematic employee I can't work with
General past possibility with 'could'
Why did Luke use his left hand to shoot?
Early credit roll before the end of the film
When do I have to declare that I want to twin my spell?
Why TEventArgs wasn't made contravariant in standard event pattern in the .Net ecosystem?
Is it possible to grant users sftp access without shell access? If yes, how is it implemented?
Airplane generations - how does it work?
Clues on how to solve these types of problems within 2-3 minutes for competitive exams
Picture with grey box as background
Why are the books in the Game of Thrones citadel library shelved spine inwards?
Move fast ...... Or you will lose
How do I append a character to the end of every line in an Excel cell?
How would an AI self awareness kill switch work?
Dilemma of explaining to interviewer that he is the reason for declining second interview
Differences between prior distribution and prior predictive distribution?
Intersections of chemistry and statisticsExperimental Design on Testing ProportionsPrior/Posterior predictive distributionsExplanation that the prior predictive (marginal) distribution follows from prior and sampling distributionsMarginal likelihood vs. prior predictive probabilityInference from the posterior predictive distributionWhat is a predictive distribution?What is the difference between a flat and weak prior?Relation between Bayesian analysis and Bayesian hierarchical analysis?Relationship between negative binomial distribution and Bayesian Poisson with Gamma priorsUse of prior and posterior predictive distributions?How do interpret a vague prior for hierarchical modeling?
$begingroup$
While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.
machine-learning bayesian inference data-mining hierarchical-bayesian
New contributor
$endgroup$
add a comment |
$begingroup$
While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.
machine-learning bayesian inference data-mining hierarchical-bayesian
New contributor
$endgroup$
add a comment |
$begingroup$
While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.
machine-learning bayesian inference data-mining hierarchical-bayesian
New contributor
$endgroup$
While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.
machine-learning bayesian inference data-mining hierarchical-bayesian
machine-learning bayesian inference data-mining hierarchical-bayesian
New contributor
New contributor
New contributor
asked 7 hours ago
Changhee KangChanghee Kang
211
211
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Let $Y$ be a random variable representing the (maybe future) data. We have a (parametric) model for $Y$ with $Y sim f(y mid theta), quad theta in Theta$, $Theta$ the parameter space. Then we have a prior distribution represented by $pi(theta)$. Given an observation of $Y$, the prior distribution of $theta$ is
$$
f(theta mid y) =frac{f(ymidtheta) pi(theta)}{int_Theta f(ymidtheta) pi(theta); dtheta} $$
The prior predictive distribution of $Y$ is then the (modeled) distribution of $Y$ marginalized over the prior, that is, integrated over $pi(theta)$:
$$
f(y) = int_Theta f(ymidtheta) pi(theta); dtheta
$$ that is, the denominator in Bayes theorem above. This is also called the preposterior distribution of $Y$. This tells you what data (that is $Y$) you expect to see before learning more about $theta$. This have many uses, for instance in design of experiments, for an example, see Experimental Design on Testing Proportions or Intersections of chemistry and statistics.
Another use is as a way to understand the prior distribution better. Say you are interested in modeling the variation in weight of elephants, and your prior distribution leads to a prior predictive with substantial probability over 20 tons. Then you might want to rethink, typical weight of largest elephants is seldom above 6 tons, so a substantial probability over 20 tons seem wrong. One interesting paper in this direction is Gelman (which do not use the terminology ...)
Finally, preposterior concepts are typically not useful with uninformative priors, they require prior modeling taken serious. One example is the following: Let $Y sim mathcal{N}(theta, 1)$ with a flat prior $pi(theta)=1$. Then the prior predictive of $Y$ is
$$
f(y)= int_{-infty}^infty frac1{sqrt{2pi}} e^{-frac12 (y-theta)^2}; dtheta = 1
$$
so is itself uniform, so not very useful.
$endgroup$
add a comment |
$begingroup$
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observation.
If $X$ denotes observation and we use the model (or likelihood) $p(x mid theta)$ then a prior distribution is a distribution for $theta$, for example $p_beta(theta)$ where $beta$ is a set of hyperparameters. Note that there's no conditioning on $beta$ , and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over $theta$,
$$
p_beta(x) = int p(x mid theta) p_beta(theta) dtheta
$$
This distribution is prior as it does not rely on any observations.
We can also define the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, dots, X_n)$ the posterior predictive distribution is
begin{align*}
p_beta(x mid X) &= int p(x mid X, theta) p_beta(theta) dtheta \
&= int p(x mid theta) p(X mid theta) p_beta(theta) dtheta \
&= int p(x mid theta) p_beta(theta mid X)dtheta
end{align*}
thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_beta(theta)$ is the former we weight with $p_beta(theta mid X)$ that is with our "updated" knowledge about $theta$.
Example : Beta-Binomial
Suppose our model is $X mid theta sim Bin(n_1,theta)$ i.e $P(X = x mid theta) = theta^x(1-theta)^{n_1-x}$.
We suppose a beta prior distribution for $theta$, $beta(a,b)$ where $(a,b)$ is the set of hyper parameters.
Then the prior predictive distribution for $theta$ is the beta-binomial distribution of parameter $(n_1,a,b)$. This discrete distribution gives the probability of $k$ successes out of $n_1$ trials given hyper-parameter $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, dots, x_{n_1})$ whith $x$ successes.
Since the binomial and beta distributions are conjugate distributions we have:
begin{align*}
p(theta mid X=x)
&propto theta^x (1 - theta)^{n_1-x} times theta^{a-1}(1-theta)^{b-1}\
&propto theta^{a+x-1}(1-theta)^{n_1+b-x-1} \
&propto beta(a+x,n_1+b-x)
end{align*}
Thus $theta mid x$ also follows a beta distribution. Then, $p(x mid x, a,b)$ follows a beta-binomial but this time of parameters $(a+x,b+n_1-x)$ rather than $(a,b)$
Upon a $beta(a,b)$ prior distribution and a $Bin(n_1,theta)$ likelihood, if we observe $x$ successes out of $n_1$ trials the posterior predictive distribution is a beta-binomial of parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play differents roles, since here the posterior predictive is about:
Given my current knowledge on $theta$ after observing $x$ successes out of $n_1$ trials, i.e $beta(n_1,a+x,n+b-x)$, what probability I have of observing $k$ successes out of $n_2$ additional trials.
I hope this is useful and clear
$endgroup$
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Changhee Kang is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f394648%2fdifferences-between-prior-distribution-and-prior-predictive-distribution%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Let $Y$ be a random variable representing the (maybe future) data. We have a (parametric) model for $Y$ with $Y sim f(y mid theta), quad theta in Theta$, $Theta$ the parameter space. Then we have a prior distribution represented by $pi(theta)$. Given an observation of $Y$, the prior distribution of $theta$ is
$$
f(theta mid y) =frac{f(ymidtheta) pi(theta)}{int_Theta f(ymidtheta) pi(theta); dtheta} $$
The prior predictive distribution of $Y$ is then the (modeled) distribution of $Y$ marginalized over the prior, that is, integrated over $pi(theta)$:
$$
f(y) = int_Theta f(ymidtheta) pi(theta); dtheta
$$ that is, the denominator in Bayes theorem above. This is also called the preposterior distribution of $Y$. This tells you what data (that is $Y$) you expect to see before learning more about $theta$. This have many uses, for instance in design of experiments, for an example, see Experimental Design on Testing Proportions or Intersections of chemistry and statistics.
Another use is as a way to understand the prior distribution better. Say you are interested in modeling the variation in weight of elephants, and your prior distribution leads to a prior predictive with substantial probability over 20 tons. Then you might want to rethink, typical weight of largest elephants is seldom above 6 tons, so a substantial probability over 20 tons seem wrong. One interesting paper in this direction is Gelman (which do not use the terminology ...)
Finally, preposterior concepts are typically not useful with uninformative priors, they require prior modeling taken serious. One example is the following: Let $Y sim mathcal{N}(theta, 1)$ with a flat prior $pi(theta)=1$. Then the prior predictive of $Y$ is
$$
f(y)= int_{-infty}^infty frac1{sqrt{2pi}} e^{-frac12 (y-theta)^2}; dtheta = 1
$$
so is itself uniform, so not very useful.
$endgroup$
add a comment |
$begingroup$
Let $Y$ be a random variable representing the (maybe future) data. We have a (parametric) model for $Y$ with $Y sim f(y mid theta), quad theta in Theta$, $Theta$ the parameter space. Then we have a prior distribution represented by $pi(theta)$. Given an observation of $Y$, the prior distribution of $theta$ is
$$
f(theta mid y) =frac{f(ymidtheta) pi(theta)}{int_Theta f(ymidtheta) pi(theta); dtheta} $$
The prior predictive distribution of $Y$ is then the (modeled) distribution of $Y$ marginalized over the prior, that is, integrated over $pi(theta)$:
$$
f(y) = int_Theta f(ymidtheta) pi(theta); dtheta
$$ that is, the denominator in Bayes theorem above. This is also called the preposterior distribution of $Y$. This tells you what data (that is $Y$) you expect to see before learning more about $theta$. This have many uses, for instance in design of experiments, for an example, see Experimental Design on Testing Proportions or Intersections of chemistry and statistics.
Another use is as a way to understand the prior distribution better. Say you are interested in modeling the variation in weight of elephants, and your prior distribution leads to a prior predictive with substantial probability over 20 tons. Then you might want to rethink, typical weight of largest elephants is seldom above 6 tons, so a substantial probability over 20 tons seem wrong. One interesting paper in this direction is Gelman (which do not use the terminology ...)
Finally, preposterior concepts are typically not useful with uninformative priors, they require prior modeling taken serious. One example is the following: Let $Y sim mathcal{N}(theta, 1)$ with a flat prior $pi(theta)=1$. Then the prior predictive of $Y$ is
$$
f(y)= int_{-infty}^infty frac1{sqrt{2pi}} e^{-frac12 (y-theta)^2}; dtheta = 1
$$
so is itself uniform, so not very useful.
$endgroup$
add a comment |
$begingroup$
Let $Y$ be a random variable representing the (maybe future) data. We have a (parametric) model for $Y$ with $Y sim f(y mid theta), quad theta in Theta$, $Theta$ the parameter space. Then we have a prior distribution represented by $pi(theta)$. Given an observation of $Y$, the prior distribution of $theta$ is
$$
f(theta mid y) =frac{f(ymidtheta) pi(theta)}{int_Theta f(ymidtheta) pi(theta); dtheta} $$
The prior predictive distribution of $Y$ is then the (modeled) distribution of $Y$ marginalized over the prior, that is, integrated over $pi(theta)$:
$$
f(y) = int_Theta f(ymidtheta) pi(theta); dtheta
$$ that is, the denominator in Bayes theorem above. This is also called the preposterior distribution of $Y$. This tells you what data (that is $Y$) you expect to see before learning more about $theta$. This have many uses, for instance in design of experiments, for an example, see Experimental Design on Testing Proportions or Intersections of chemistry and statistics.
Another use is as a way to understand the prior distribution better. Say you are interested in modeling the variation in weight of elephants, and your prior distribution leads to a prior predictive with substantial probability over 20 tons. Then you might want to rethink, typical weight of largest elephants is seldom above 6 tons, so a substantial probability over 20 tons seem wrong. One interesting paper in this direction is Gelman (which do not use the terminology ...)
Finally, preposterior concepts are typically not useful with uninformative priors, they require prior modeling taken serious. One example is the following: Let $Y sim mathcal{N}(theta, 1)$ with a flat prior $pi(theta)=1$. Then the prior predictive of $Y$ is
$$
f(y)= int_{-infty}^infty frac1{sqrt{2pi}} e^{-frac12 (y-theta)^2}; dtheta = 1
$$
so is itself uniform, so not very useful.
$endgroup$
Let $Y$ be a random variable representing the (maybe future) data. We have a (parametric) model for $Y$ with $Y sim f(y mid theta), quad theta in Theta$, $Theta$ the parameter space. Then we have a prior distribution represented by $pi(theta)$. Given an observation of $Y$, the prior distribution of $theta$ is
$$
f(theta mid y) =frac{f(ymidtheta) pi(theta)}{int_Theta f(ymidtheta) pi(theta); dtheta} $$
The prior predictive distribution of $Y$ is then the (modeled) distribution of $Y$ marginalized over the prior, that is, integrated over $pi(theta)$:
$$
f(y) = int_Theta f(ymidtheta) pi(theta); dtheta
$$ that is, the denominator in Bayes theorem above. This is also called the preposterior distribution of $Y$. This tells you what data (that is $Y$) you expect to see before learning more about $theta$. This have many uses, for instance in design of experiments, for an example, see Experimental Design on Testing Proportions or Intersections of chemistry and statistics.
Another use is as a way to understand the prior distribution better. Say you are interested in modeling the variation in weight of elephants, and your prior distribution leads to a prior predictive with substantial probability over 20 tons. Then you might want to rethink, typical weight of largest elephants is seldom above 6 tons, so a substantial probability over 20 tons seem wrong. One interesting paper in this direction is Gelman (which do not use the terminology ...)
Finally, preposterior concepts are typically not useful with uninformative priors, they require prior modeling taken serious. One example is the following: Let $Y sim mathcal{N}(theta, 1)$ with a flat prior $pi(theta)=1$. Then the prior predictive of $Y$ is
$$
f(y)= int_{-infty}^infty frac1{sqrt{2pi}} e^{-frac12 (y-theta)^2}; dtheta = 1
$$
so is itself uniform, so not very useful.
edited 4 hours ago
Christoph Hanck
17k34074
17k34074
answered 5 hours ago
kjetil b halvorsenkjetil b halvorsen
30.6k983220
30.6k983220
add a comment |
add a comment |
$begingroup$
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observation.
If $X$ denotes observation and we use the model (or likelihood) $p(x mid theta)$ then a prior distribution is a distribution for $theta$, for example $p_beta(theta)$ where $beta$ is a set of hyperparameters. Note that there's no conditioning on $beta$ , and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over $theta$,
$$
p_beta(x) = int p(x mid theta) p_beta(theta) dtheta
$$
This distribution is prior as it does not rely on any observations.
We can also define the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, dots, X_n)$ the posterior predictive distribution is
begin{align*}
p_beta(x mid X) &= int p(x mid X, theta) p_beta(theta) dtheta \
&= int p(x mid theta) p(X mid theta) p_beta(theta) dtheta \
&= int p(x mid theta) p_beta(theta mid X)dtheta
end{align*}
thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_beta(theta)$ is the former we weight with $p_beta(theta mid X)$ that is with our "updated" knowledge about $theta$.
Example : Beta-Binomial
Suppose our model is $X mid theta sim Bin(n_1,theta)$ i.e $P(X = x mid theta) = theta^x(1-theta)^{n_1-x}$.
We suppose a beta prior distribution for $theta$, $beta(a,b)$ where $(a,b)$ is the set of hyper parameters.
Then the prior predictive distribution for $theta$ is the beta-binomial distribution of parameter $(n_1,a,b)$. This discrete distribution gives the probability of $k$ successes out of $n_1$ trials given hyper-parameter $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, dots, x_{n_1})$ whith $x$ successes.
Since the binomial and beta distributions are conjugate distributions we have:
begin{align*}
p(theta mid X=x)
&propto theta^x (1 - theta)^{n_1-x} times theta^{a-1}(1-theta)^{b-1}\
&propto theta^{a+x-1}(1-theta)^{n_1+b-x-1} \
&propto beta(a+x,n_1+b-x)
end{align*}
Thus $theta mid x$ also follows a beta distribution. Then, $p(x mid x, a,b)$ follows a beta-binomial but this time of parameters $(a+x,b+n_1-x)$ rather than $(a,b)$
Upon a $beta(a,b)$ prior distribution and a $Bin(n_1,theta)$ likelihood, if we observe $x$ successes out of $n_1$ trials the posterior predictive distribution is a beta-binomial of parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play differents roles, since here the posterior predictive is about:
Given my current knowledge on $theta$ after observing $x$ successes out of $n_1$ trials, i.e $beta(n_1,a+x,n+b-x)$, what probability I have of observing $k$ successes out of $n_2$ additional trials.
I hope this is useful and clear
$endgroup$
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
add a comment |
$begingroup$
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observation.
If $X$ denotes observation and we use the model (or likelihood) $p(x mid theta)$ then a prior distribution is a distribution for $theta$, for example $p_beta(theta)$ where $beta$ is a set of hyperparameters. Note that there's no conditioning on $beta$ , and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over $theta$,
$$
p_beta(x) = int p(x mid theta) p_beta(theta) dtheta
$$
This distribution is prior as it does not rely on any observations.
We can also define the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, dots, X_n)$ the posterior predictive distribution is
begin{align*}
p_beta(x mid X) &= int p(x mid X, theta) p_beta(theta) dtheta \
&= int p(x mid theta) p(X mid theta) p_beta(theta) dtheta \
&= int p(x mid theta) p_beta(theta mid X)dtheta
end{align*}
thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_beta(theta)$ is the former we weight with $p_beta(theta mid X)$ that is with our "updated" knowledge about $theta$.
Example : Beta-Binomial
Suppose our model is $X mid theta sim Bin(n_1,theta)$ i.e $P(X = x mid theta) = theta^x(1-theta)^{n_1-x}$.
We suppose a beta prior distribution for $theta$, $beta(a,b)$ where $(a,b)$ is the set of hyper parameters.
Then the prior predictive distribution for $theta$ is the beta-binomial distribution of parameter $(n_1,a,b)$. This discrete distribution gives the probability of $k$ successes out of $n_1$ trials given hyper-parameter $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, dots, x_{n_1})$ whith $x$ successes.
Since the binomial and beta distributions are conjugate distributions we have:
begin{align*}
p(theta mid X=x)
&propto theta^x (1 - theta)^{n_1-x} times theta^{a-1}(1-theta)^{b-1}\
&propto theta^{a+x-1}(1-theta)^{n_1+b-x-1} \
&propto beta(a+x,n_1+b-x)
end{align*}
Thus $theta mid x$ also follows a beta distribution. Then, $p(x mid x, a,b)$ follows a beta-binomial but this time of parameters $(a+x,b+n_1-x)$ rather than $(a,b)$
Upon a $beta(a,b)$ prior distribution and a $Bin(n_1,theta)$ likelihood, if we observe $x$ successes out of $n_1$ trials the posterior predictive distribution is a beta-binomial of parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play differents roles, since here the posterior predictive is about:
Given my current knowledge on $theta$ after observing $x$ successes out of $n_1$ trials, i.e $beta(n_1,a+x,n+b-x)$, what probability I have of observing $k$ successes out of $n_2$ additional trials.
I hope this is useful and clear
$endgroup$
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
add a comment |
$begingroup$
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observation.
If $X$ denotes observation and we use the model (or likelihood) $p(x mid theta)$ then a prior distribution is a distribution for $theta$, for example $p_beta(theta)$ where $beta$ is a set of hyperparameters. Note that there's no conditioning on $beta$ , and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over $theta$,
$$
p_beta(x) = int p(x mid theta) p_beta(theta) dtheta
$$
This distribution is prior as it does not rely on any observations.
We can also define the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, dots, X_n)$ the posterior predictive distribution is
begin{align*}
p_beta(x mid X) &= int p(x mid X, theta) p_beta(theta) dtheta \
&= int p(x mid theta) p(X mid theta) p_beta(theta) dtheta \
&= int p(x mid theta) p_beta(theta mid X)dtheta
end{align*}
thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_beta(theta)$ is the former we weight with $p_beta(theta mid X)$ that is with our "updated" knowledge about $theta$.
Example : Beta-Binomial
Suppose our model is $X mid theta sim Bin(n_1,theta)$ i.e $P(X = x mid theta) = theta^x(1-theta)^{n_1-x}$.
We suppose a beta prior distribution for $theta$, $beta(a,b)$ where $(a,b)$ is the set of hyper parameters.
Then the prior predictive distribution for $theta$ is the beta-binomial distribution of parameter $(n_1,a,b)$. This discrete distribution gives the probability of $k$ successes out of $n_1$ trials given hyper-parameter $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, dots, x_{n_1})$ whith $x$ successes.
Since the binomial and beta distributions are conjugate distributions we have:
begin{align*}
p(theta mid X=x)
&propto theta^x (1 - theta)^{n_1-x} times theta^{a-1}(1-theta)^{b-1}\
&propto theta^{a+x-1}(1-theta)^{n_1+b-x-1} \
&propto beta(a+x,n_1+b-x)
end{align*}
Thus $theta mid x$ also follows a beta distribution. Then, $p(x mid x, a,b)$ follows a beta-binomial but this time of parameters $(a+x,b+n_1-x)$ rather than $(a,b)$
Upon a $beta(a,b)$ prior distribution and a $Bin(n_1,theta)$ likelihood, if we observe $x$ successes out of $n_1$ trials the posterior predictive distribution is a beta-binomial of parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play differents roles, since here the posterior predictive is about:
Given my current knowledge on $theta$ after observing $x$ successes out of $n_1$ trials, i.e $beta(n_1,a+x,n+b-x)$, what probability I have of observing $k$ successes out of $n_2$ additional trials.
I hope this is useful and clear
$endgroup$
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observation.
If $X$ denotes observation and we use the model (or likelihood) $p(x mid theta)$ then a prior distribution is a distribution for $theta$, for example $p_beta(theta)$ where $beta$ is a set of hyperparameters. Note that there's no conditioning on $beta$ , and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over $theta$,
$$
p_beta(x) = int p(x mid theta) p_beta(theta) dtheta
$$
This distribution is prior as it does not rely on any observations.
We can also define the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, dots, X_n)$ the posterior predictive distribution is
begin{align*}
p_beta(x mid X) &= int p(x mid X, theta) p_beta(theta) dtheta \
&= int p(x mid theta) p(X mid theta) p_beta(theta) dtheta \
&= int p(x mid theta) p_beta(theta mid X)dtheta
end{align*}
thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_beta(theta)$ is the former we weight with $p_beta(theta mid X)$ that is with our "updated" knowledge about $theta$.
Example : Beta-Binomial
Suppose our model is $X mid theta sim Bin(n_1,theta)$ i.e $P(X = x mid theta) = theta^x(1-theta)^{n_1-x}$.
We suppose a beta prior distribution for $theta$, $beta(a,b)$ where $(a,b)$ is the set of hyper parameters.
Then the prior predictive distribution for $theta$ is the beta-binomial distribution of parameter $(n_1,a,b)$. This discrete distribution gives the probability of $k$ successes out of $n_1$ trials given hyper-parameter $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, dots, x_{n_1})$ whith $x$ successes.
Since the binomial and beta distributions are conjugate distributions we have:
begin{align*}
p(theta mid X=x)
&propto theta^x (1 - theta)^{n_1-x} times theta^{a-1}(1-theta)^{b-1}\
&propto theta^{a+x-1}(1-theta)^{n_1+b-x-1} \
&propto beta(a+x,n_1+b-x)
end{align*}
Thus $theta mid x$ also follows a beta distribution. Then, $p(x mid x, a,b)$ follows a beta-binomial but this time of parameters $(a+x,b+n_1-x)$ rather than $(a,b)$
Upon a $beta(a,b)$ prior distribution and a $Bin(n_1,theta)$ likelihood, if we observe $x$ successes out of $n_1$ trials the posterior predictive distribution is a beta-binomial of parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play differents roles, since here the posterior predictive is about:
Given my current knowledge on $theta$ after observing $x$ successes out of $n_1$ trials, i.e $beta(n_1,a+x,n+b-x)$, what probability I have of observing $k$ successes out of $n_2$ additional trials.
I hope this is useful and clear
answered 5 hours ago
winperiklewinperikle
584
584
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
add a comment |
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
$begingroup$
Yeap, I believe I have understood what you have explained here. Thank you very much.
$endgroup$
– Changhee Kang
3 hours ago
add a comment |
Changhee Kang is a new contributor. Be nice, and check out our Code of Conduct.
Changhee Kang is a new contributor. Be nice, and check out our Code of Conduct.
Changhee Kang is a new contributor. Be nice, and check out our Code of Conduct.
Changhee Kang is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f394648%2fdifferences-between-prior-distribution-and-prior-predictive-distribution%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown