←back to thread

Bayesian Statistics: The three cultures

(statmodeling.stat.columbia.edu)
309 points luu | 2 comments | | HN request time: 0s | source
Show context
tfehring ◴[] No.41081746[source]
The author is claiming that Bayesians vary along two axes: (1) whether they generally try to inform their priors with their knowledge or beliefs about the world, and (2) whether they iterate on the functional form of the model based on its goodness-of-fit and the reasonableness and utility of its outputs. He then labels 3 of the 4 resulting combinations as follows:

    ┌───────────────┬───────────┬──────────────┐
    │               │ iteration │ no iteration │
    ├───────────────┼───────────┼──────────────┤
    │ informative   │ pragmatic │ subjective   │
    │ uninformative │     -     │ objective    │
    └───────────────┴───────────┴──────────────┘
My main disagreement with this model is the empty bottom-left box - in fact, I think that's where most self-labeled Bayesians in industry fall:

- Iterating on the functional form of the model (and therefore the assumed underlying data generating process) is generally considered obviously good and necessary, in my experience.

- Priors are usually uninformative or weakly informative, partly because data is often big enough to overwhelm the prior.

The need for iteration feels so obvious to me that the entire "no iteration" column feels like a straw man. But the author, who knows far more academic statisticians than I do, explicitly says that he had the same belief and "was shocked to learn that statisticians didn’t think this way."

replies(3): >>41081867 #>>41082105 #>>41084103 #
klysm ◴[] No.41081867[source]
The no iteration thing is very real and I don’t think it’s even for particularly bad reasons. We iterate on models to make them better, by some definition of better. It’s no secret that scientific work is subject to rather perverse incentives around thresholds of significance and positive results. Publish or perish. Perverse incentives lead to perverse statistics.

The iteration itself is sometimes viewed directly as a problem. The “garden of forking paths”, where the analysis depends on the data, is viewed as a direct cause for some of the statistical and epistemological crises in science today.

Iteration itself isn’t inherently bad. It’s just that the objective function usually isn’t what we want from a scientific perspective.

To those actually doing scientific work, I suspect iterating on their models feels like they’re doing something unfaithful.

Furthermore, I believe a lot of these issues are strongly related to the flawed epistemological framework which many scientific fields seem to have converged: p<0.05 means it’s true, otherwise it’s false.

edit:

Perhaps another way to characterize this discomfort is by the number of degrees of freedom that the analyst controls. In a Bayesian context where we are picking priors either by belief or previous data, the analyst has a _lot_ of control over how the results come out the other end.

I think this is why fields have trended towards a set of ‘standard’ tests instead of building good statistical models. These take most of the knobs out of the hands of the analyst, and generally are more conservative.

replies(3): >>41081904 #>>41082486 #>>41082720 #
j7ake ◴[] No.41082486[source]
Iteration is necessary for any analysis. To safeguard yourself from overfitting, be sure to have a hold out dataset that hasn’t been touched until the end.
replies(1): >>41084424 #
1. laichzeit0 ◴[] No.41084424[source]
What about automated predictive modeling pipelines? In other words, I want the best possible point estimates only on future data. I’d think, regardless of the model selection process, I want to reestimate the parameters on the entire dataset before I deploy it, so as not to “waste” data? I.e. I want to use the hold out test data in the final model. Is this valid?
replies(1): >>41085909 #
2. disgruntledphd2 ◴[] No.41085909[source]
> What about automated predictive modeling pipelines? In other words, I want the best possible point estimates only on future data. I’d think, regardless of the model selection process, I want to reestimate the parameters on the entire dataset before I deploy it, so as not to “waste” data? I.e. I want to use the hold out test data in the final model. Is this valid?

Personally, I think that as long as you're generating data constantly (through some kind of software/hardware process), then you'd be well served to keep your sets pure and build the model finally only on data not used in the original process. This is often wildly impractical (and is probably controversial even within the field), but it's safer.

(If you train on the entire internet, this may not be possible also).