(UPDATE: edits added to the model section, correcting bonehead mistake. There is no stratification of model families by treatment, because the treatment is a categorical variable in the model... )
There's been an interesting (to dorks like me) discussion going on over at Chris Masterjohn's blog "Mother Nature Obeyed" concerning statistical frameworks, and the meaning of that ever-elusive p-value:
Specifically, we've been discussing the peculiar results of one of the first human diet trials which substituted modern processed seed oils, for traditional animal fats-- the LA Veterans Administration trial. This trial is often cited in reviews and meta-analyses to adduce support for the hypothesis that polyunsaturated-fat-rich vegetable oils reduce the risk of cardiovascular mortality. In fact, as Masterjohn and others have pointed out, the actual results are more interesting.
(More details can be found here: http://www.westonaprice.org/know-your-fats/good-fats-bad-fats-separating-fact-from-fiction)
There were two treatment groups- an intervention that substituted vegetable oils, and a control that stayed with traditional animal fats (e.g. butter). Significantly, the study population had a mean age above 60 years old, and was carried out for a relatively long time, 8 years. The researchers tracked total mortality, cardiovascular mortality, and cancer mortality. The cancer data were only possible because of the age of the cohorts, and this is evidently a unique aspect of this study.
In brief, the total mortality data were a wash, the cardiovascular mortality showed a benefit to vegetable oils, but the cancer data showed a benefit to animal fats!
But the plot thickens. It turns out the randomization was not as effective as it should have been at controlling for smoking rates, so the animal fat group was saddled up with a much larger rate of heavy and moderate smokers. In essence, this has lead Masterjohn to (quite reasonably) conclude that, absent confounding, the animal fats cohort would probably have not had a higher CHD risk, and would have had an even lower relative cancer risk than was observed!
The study authors concluded that the effects of treatment (animal fat versus vegetable oil) on cancer risk "only" attained a p-value of 0.06 (above the arbitrary threshold of p=0.05 for avoiding type I error, i.e. incorrectly rejecting a true null hypothesis), and was thus "not significant". But the effect size was decent (20% or so), and there's that pesky issue of confounding, confound it!
So, the question is, how seriously should we take this p-value? Are there alternative approaches to exploring these data that avoid the pesky arbitrariness?
This lead to a discussion about interval estimation, it's relation to p-values, and the differing interpretations that frequentists and Bayesians bring to the table.
P-values and Confidence Intervals
In brief, frequentist statistical hypothesis testing is all about rejecting or failing to reject a null hypothesis. It asks for the probability of the data given the null hypothesis P(D/Ho). The null hypothesis specifies no statistical correlation (slope=0 in a linear regression, for example). Rejecting the null hypothesis doesn't mean that your alternative hypotheses are necessarily true. Frequentist intervals ("confidence intervals") are constructed so that a p% interval, in a long-run of hypothetical experiments, contains the "true" parameter value p% of the time. It is incorrect to say that you are p% certain that particular interval contains the true parameter value, since it either does or doesn't. This is hugely counter-intuitive.
In contrast, Bayesian hypothesis testing first specifies a prior (either informative or non-informative), then calculates a posterior distribution or probability for a specific hypothesis, P(Hi/D). This is a more direct assessment of some particular quantitative hypothesis than the frequentist framework of rejecting a null hypothesis. It is correct to assert a p% confidence in the hypothesis, given the posterior data. Likewise, Bayesian intervals (credibility index, credible set, etc.) are expressions of confidence that the underlying parameter (which Bayesians treat as a random variable, and not fixed) has a value in that range *based on these data* and not a hypothetical long-run of experiments.
It is very important to note that the choice of a non-informative prior usually makes a Bayesian analysis basically identical to it's frequentist counter-part. It can do all the same things, but is often computationally clunky, and there are still philosophical differences in interpretation, as suggested above.
In the end, I don't think these differences in approach mean much for how to interpret the p-value of 0.06 the study authors reported. Since this study is unique, there's little justification for a Bayesian informative prior to re-run an ANOVA style hypothesis test with.
But I think there's another way (although I'm admittedly way out of my depth in speculative land). My background is in ecology and agriculture, not medicine or nutrition, although the fields interest me.
Bayesian Modeling- An Alternative Proposal
In the comments section of the above post (http://www.westonaprice.org/blogs/cmasterjohn/2012/07/17/im-95-confident-this-is-a-good-definition-of-a-p-value/) I outlined an alternative approach to these data. Rather than quibbling with the p-value the study authors reported, I would build a Bayesian statistical model and use Bayesian model selection to select among the best, and would estimate credible sets for the parameter values. This approach is not an alternative to hypothesis testing (which was already done), but is a complementary approach to understanding and getting information out of data.
"I would use a Bayesian statistical model that has both informative
and uninformative priors (be warned that I’m way out of my depth here,
my only familiarity with Bayesian inference is in ecological studies):
I would build a generalized linear model with three predictor
variables: veggie-oil/animal-fat(categorical), age (categorical, probably) and
smoking rates (continuous). The response variable is cancer rate. You
have two sets of models (one for each treatment).
Rate=B(0)+B(1)(Veggie-oil OR animal fat)+B(2)(smoking)+B(3)(age)+interaction terms
You could use frequentist model selection and parameter estimation at
this point. However, using a Bayesian approach you would specify priors
for each parameter. Since it seems there should be robust data for the effects of smoking
and age on cancer risk, it makes sense to incorporate them as Bayesian
priors (rather than treating this study as a de novo universe for cancer
risk attributable to smoking and age). Might the NIH have such data
I would choose a non-informative Bayesian prior for the
animal-fat/veggie-oil parameter. This is where the absence of previous
studies comes into the picture.
Anyways, I would evaluate these data with Bayesian model selection and generate credible sets for the parameters. Rather than simple hypothesis testing we’ll: 1) select the “best”
models for cancer risk, and 2) build credible
sets for the parameters of interest to explore their biological meaning.
P-values would never enter into this analysis. Afterall, we already
have a p-value (p=0.06), and it’s arbitrary to say that’s not
significant but p=0.05 is! I think this statistical model would be far
Model Selection and Interpretation
Using information-theoretic methods (Bayesian analogues of AIC), we'll discover what the best models are for predicting cancer risk in these cohorts. If none of the best models include a particular parameter, it's probably safe to say that parameter isn't really meaningful in this study. Also, it would be interesting to note if any of the best models contain interaction terms, like, say (oil)X(age).
The credible sets around the parameter will tell us what the range in effect size is, and since we're being Bayesian we're p% sure of it!
Finally, the advantage of the Bayesian approach is that we should be able to integrate high-quality data from, say, the NIH on the influence of smoking rates and age on cancer for cohorts such as were in the LA Veterans study, and not pretend that this study is an isolated universe for sampling cancer risk factors.
For the vegetable oil or animal fat parameters, we'd use an uninformative prior, because there's no justification for anything else. Nevertheless, the results of our model (the posterior for that factors) could become priors for future studies of similar design...
This is the beauty and the danger of Bayesian priors.
Before doing the modelling, I'm not sure what to expect. It may be some of my assumptions are unwarranted. The devil is always in the details. At any rate, this kind of modeling is an interesting complement to ANOVA-style hypothesis testing.
My probably over-hasty thought is that these data would, in model selection, show a protective effect for the animal fats. The inference would certainly be controversial! I'm certain that no-one in Harvard Public Health would believe it ;)