I was reading a psychology paper that promised to change how I thought about willpower.
The study claimed that ego depletion—the idea that willpower is a limited resource that gets exhausted—had been proven through rigorous experiments. Hundreds of studies supported it. It was taught in psychology courses. It was in textbooks.
I built my productivity system around this concept. I scheduled important decisions for the morning. I avoided making choices when I was tired. I believed willpower worked like a muscle that could be depleted.
Then I learned that when other researchers tried to replicate the ego depletion studies, they couldn’t reproduce the results.
The effect disappeared.
This wasn’t an isolated incident. This was part of something much bigger and more troubling: The Replication Crisis.
And it’s not just academic drama—it affects every piece of psychology research you’ve ever read, every self-help book based on “science,” every intervention claimed to work.
Let me explain why psychology research is in crisis, and what it means for how you should think about scientific claims.
What Is the Replication Crisis?
The replication crisis is the alarming discovery that many published research findings cannot be reproduced when other scientists try to repeat the experiments.
The Moment It Became Undeniable
In 2015, a massive collaborative project called the Reproducibility Project: Psychology tried to replicate 100 psychology studies published in top journals.
The results were devastating:
- Only 36% of the studies replicated successfully
- Many effect sizes (how strong the phenomenon is) were much smaller when replicated
- Some famous findings disappeared entirely
Think about that. 64% of published findings in prestigious psychology journals couldn’t be reproduced.
This wasn’t because the replicators were incompetent. The original authors were often involved. The methodology was rigorous.
The original findings were just… wrong.
Not Just Psychology
While psychology has gotten the most attention, the crisis extends to:
Medicine:
- Only 11% of landmark cancer research findings could be replicated (Begley & Ellis, 2012)
- Many drug studies fail to replicate
Economics:
- Replication rates around 60-70%
- Many behavioral economics findings are fragile
Social Sciences Generally:
- Political science, sociology, education research all affected
- Varying replication rates, but all concerning
Even Hard Sciences (to a lesser extent):
- Biology and chemistry have replication issues
- Physics is most reliable, but not immune
But psychology is the poster child for the crisis, and understanding why reveals deep problems in how science is done.
How Did This Happen? The Perverse Incentives of Academic Publishing
The replication crisis isn’t about a few bad apples. It’s a systemic problem created by how academic research is incentivized, published, and rewarded.
Problem #1: Publish or Perish
Academic careers depend almost entirely on publishing papers in prestigious journals.
The pressure:
- Tenure depends on publication count and journal prestige
- Grants depend on publication record
- Status depends on citations
- “Publish or perish” is literal
The consequence:
- Researchers need to produce positive, novel, surprising findings
- Negative results (no effect found) don’t get published
- Replications don’t get published (they’re “not novel”)
- Incentive is to find effects, not to find truth
Problem #2: Publication Bias (The File Drawer Problem)
Journals want to publish exciting, positive results. They reject boring negative findings.
What this creates:
Imagine 20 researchers independently test whether power posing increases confidence.
- 19 find no effect
- 1 finds a positive effect (by chance)
What gets published? The one positive study.
What gets filed away? The 19 negative studies.
The published literature shows “power posing works!” The file drawers contain the truth: “It probably doesn’t.”
The result: The published literature is systematically biased toward false positives.
This is called publication bias or the file drawer problem.
Problem #3: P-Hacking (Data Torture)
In statistics, p < 0.05 (less than 5% probability the result is due to chance) is the threshold for “statistically significant.”
The problem: Researchers are incentivized to get that magical p < 0.05 by any means necessary.
How to p-hack (also called data dredging or fishing):
1. Trying multiple analyses until one “works”:
- Test 20 different relationships
- Only report the one that’s significant
- Don’t mention the 19 that weren’t
2. Optional stopping:
- Collect data continuously
- Check for significance repeatedly
- Stop collecting data once you hit p < 0.05
- (If you’d kept going, it might have disappeared)
3. Selective exclusion:
- Remove “outliers” that weaken your effect
- Justify why those data points “don’t count”
- Keep removing until p < 0.05
4. Exploring multiple dependent variables:
- Measure 10 different outcomes
- Report only the ones that show effects
- Ignore the ones that don’t
5. Adding covariates:
- Add control variables one at a time
- See which combination gives you significance
- Report that combination as “the model”
None of these are technically fraud. They’re all “researcher degrees of freedom.”
But they inflate false positive rates from 5% to as high as 60%.
Researcher Andrew Gelman calls this “The Garden of Forking Paths.”
There are so many decisions to make in data analysis (which participants to exclude, which variables to control for, how to transform data, when to stop collecting) that you can almost always find a path to significance.
It’s not intentional dishonesty. It’s unconscious bias combined with perverse incentives.
Problem #4: HARKing (Hypothesizing After Results Known)
The scientific method:
- Form hypothesis
- Design experiment to test it
- Collect data
- Analyze results
What actually happens (HARKing):
- Collect data (maybe with vague hypothesis)
- Explore the data
- Find a pattern
- Pretend you predicted it all along
- Write paper as if you had a strong a priori hypothesis
Why this is a problem:
If you explore data looking for patterns, you’ll find them (even in random data). This is called overfitting.
The pattern might be real, or it might be noise. You need new data to test whether it holds up.
But HARKing presents exploratory findings as confirmatory. You’re pretending you tested a specific hypothesis when you actually went fishing for patterns.
This massively increases false positives.
Problem #5: Small Sample Sizes
Many psychology studies have laughably small sample sizes:
- 20 participants per condition
- Sometimes as few as 10
Why this is a problem:
1. Low statistical power:
- Small samples can’t reliably detect effects
- Only large effects will be significant
- But real effects are often small
2. Winner’s curse:
- When a small study finds significance, the effect size is probably overestimated
- You got lucky with an extreme sample
- Replication will show smaller (or no) effect
3. Sampling error:
- Small samples are more subject to random variation
- What looks like an effect might just be a weird sample
Example:
You want to know if a coin is fair. You flip it 10 times. You get 7 heads.
Conclusion: “The coin is biased toward heads! p < 0.05!”
Reality: Flip it 1000 times. You get 510 heads. The coin is basically fair. Your sample of 10 was just random variation.
Many psychology studies are like flipping a coin 10 times and declaring profound discoveries.
Problem #6: Lack of Pre-Registration
In medicine, clinical trials are pre-registered. Researchers must:
- State their hypothesis in advance
- Specify their analysis plan
- Register with a public database
This prevents p-hacking and HARKing.
Psychology historically didn’t do this.
Researchers could:
- Try multiple hypotheses
- Explore the data
- Report what “worked”
- Never mention what didn’t
It’s impossible to tell from reading a paper whether the findings are real or the result of data exploration.
This is finally changing (more on this later), but decades of research lack pre-registration.
Famous Examples: Studies That Didn’t Replicate
Let’s look at some high-profile findings that collapsed under replication attempts.
Example 1: Ego Depletion (The Willpower Battery)
The Original Claim:
Willpower is a limited resource. Exerting self-control in one domain (resisting cookies) depletes your ability to exert self-control in another domain (solving hard puzzles).
The Evidence:
Hundreds of studies showed this effect. Meta-analyses confirmed it. Roy Baumeister’s research became famous. Books were written. Strategies were designed around it.
The Replication:
In 2016, a massive pre-registered replication with 2,141 participants across 23 labs found… no evidence of ego depletion.
The effect vanished.
What happened?
Likely a combination of:
- Publication bias (negative results weren’t published)
- Small samples (early studies had few participants)
- P-hacking (flexibility in how “depletion” was measured)
The implications:
I’m not saying ego depletion is definitely false. But the evidence is far weaker than we thought. The effect (if it exists) is much smaller and more context-dependent than claimed.
This was taught as established fact. It was in textbooks. And it might not be real.
Example 2: Power Posing
The Original Claim:
Standing in a powerful pose (hands on hips, chest out) for 2 minutes increases confidence, testosterone, and risk-taking.
The Evidence:
Amy Cuddy’s 2010 study. Became a viral TED talk (60+ million views). Inspired a movement. People power posed before job interviews.
The Replication:
Multiple attempts to replicate found:
- No effect on testosterone
- Weak or no effect on behavior
- Maybe a small effect on self-reported feelings (but that could be placebo)
The aftermath:
One of the original co-authors, Dana Carney, publicly stated she no longer believes the effect is real.
Amy Cuddy stands by the finding (sort of—she’s walked back the hormonal claims).
The truth: Power posing might make you feel slightly more confident (placebo effect?), but the dramatic hormonal and behavioral effects claimed in the original study don’t hold up.
Example 3: Social Priming
The Original Claim:
Subtle environmental cues influence behavior unconsciously.
Famous examples:
Florida Effect (Bargh et al., 1996):
- Participants unscrambled sentences with words related to elderly (wrinkle, gray, bingo)
- Afterward, they walked slower down the hallway
- Conclusion: Priming “elderly” activated elderly stereotypes, affecting behavior
Money Priming:
- Seeing images of money makes people more selfish and less helpful
Achievement Priming:
- Exposing people to briefcases and fountain pens improves performance on achievement tasks
The Replication:
Most of these effects failed to replicate.
The Florida effect, in particular, has been attempted dozens of times with mostly null results.
What likely happened:
- Small samples (original Florida study: 30 participants)
- Researcher expectations (experimenters might unconsciously influence participants)
- P-hacking and selective reporting
Some priming effects are real (semantic priming in reaction time studies is robust), but the dramatic behavioral effects from subtle cues? Highly questionable.
Example 4: Facial Feedback Hypothesis (Pen-in-Teeth)
The Original Claim:
Forcing your face into a smile (by holding a pen in your teeth) makes you feel happier. Facial expressions don’t just reflect emotions—they cause them.
The Evidence:
Classic study by Strack, Martin, & Stepper (1988). People rated cartoons as funnier when forced to smile.
The Replication:
A massive pre-registered replication (17 labs, 1,894 participants) found… no effect.
People didn’t rate cartoons as funnier when smiling.
The original authors defended the study, suggesting the replication changed important details.
But the core claim—that manipulating facial muscles changes emotional experience in this specific way—is now highly questionable.
Example 5: Growth Mindset
The Original Claim:
Teaching students that intelligence is malleable (growth mindset) rather than fixed improves academic performance, especially for struggling students.
The Evidence:
Carol Dweck’s research became hugely influential. Schools worldwide implemented growth mindset interventions.
The Replication:
Results are mixed:
- Some studies replicate
- Many don’t
- Effect sizes are much smaller than originally claimed
- May only work in specific contexts
- May fade over time
My take:
Growth mindset isn’t wrong—believing you can improve probably helps. But it’s not the silver bullet it was portrayed as.
The educational establishment ran with preliminary findings and scaled interventions before solid replication.
The Deeper Problem: Understanding What Science Actually Is
The replication crisis isn’t just “some studies were wrong.” It reveals fundamental misunderstandings about how science works.
Misunderstanding #1: Published = True
What people think: “It’s a peer-reviewed study in a top journal. It must be true.”
Reality: “It’s a preliminary finding that might be true, might be exaggerated, or might be false. Replication and meta-analysis will clarify.”
Publication is the beginning of scientific evaluation, not the end.
Misunderstanding #2: Statistically Significant = Important
What p < 0.05 actually means: “If there were truly no effect, there’s less than a 5% chance we’d see data this extreme by random chance.”
What people think it means: “There’s a 95% chance this finding is true and important.”
These are not the same.
Problems with p-values:
1. They don’t tell you the probability the hypothesis is true
- P-values assume the null hypothesis is true and ask “how surprising is this data?”
- They don’t tell you how likely the alternative hypothesis is
2. They don’t tell you how big or important the effect is
- p < 0.05 can mean a tiny, meaningless effect with a large sample
- Clinical significance ≠ statistical significance
3. They’re easy to hack
- As we’ve seen, p-values are easily manipulated
The American Statistical Association released a statement in 2016:
“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”
But that’s exactly what we’ve been doing.
Misunderstanding #3: A Single Study Proves Anything
How science should work:
- Exploratory research generates hypotheses
- Confirmatory research tests hypotheses (pre-registered)
- Replication verifies findings across contexts
- Meta-analysis synthesizes evidence
- Confidence builds gradually
How it actually works:
- Single study published
- Media coverage: “Science says X!”
- Everyone believes X
- No one does replications (not novel enough)
- Original finding might be completely wrong
Science is supposed to be cumulative and self-correcting. But the incentives prevent this from happening.
Misunderstanding #4: Researchers Are Objective
The ideal: Scientists are neutral truth-seekers who follow data wherever it leads.
The reality: Scientists are humans with:
- Career incentives (publish or perish)
- Confirmation bias (we see what we expect)
- Motivated reasoning (we defend our theories)
- Ego investment (our research defines our identity)
This doesn’t make scientists bad people. It makes them human.
But it means we need systems that counteract these biases:
- Pre-registration (commit to analysis before seeing data)
- Open data (others can check your work)
- Replication (verify findings independently)
- Adversarial collaboration (work with critics)
We haven’t had these systems. That’s why we’re in crisis.
What’s Being Done? The Reform Movement
The good news: the scientific community is taking the replication crisis seriously and implementing reforms.
Reform #1: Pre-Registration
Researchers now register their hypotheses, methods, and analysis plans before collecting data.
What this prevents:
- P-hacking
- HARKing
- Selective reporting
Platforms:
- Open Science Framework (OSF)
- AsPredicted
- ClinicalTrials.gov (for clinical research)
This is becoming standard practice in top psychology journals.
Reform #2: Open Data and Open Materials
Researchers share:
- Raw data
- Analysis code
- Experimental materials
- Full methods
What this enables:
- Others can verify analyses
- Researchers can re-analyze with different methods
- Meta-analyses can use original data
- Transparency reveals errors
Requirements: Many journals now require or strongly encourage open data.
Reform #3: Registered Reports
A new publication format:
Stage 1:
- Submit hypothesis and methods before data collection
- Journal reviews and approves
- Acceptance is conditional (if you follow the plan, they’ll publish regardless of results)
Stage 2:
- Conduct research
- Submit results
- Journal publishes (even if results are null)
What this solves:
- Publication bias (negative results get published)
- P-hacking (analysis is locked in advance)
- HARKing (hypothesis is on record)
Adoption: 300+ journals now offer registered reports.
Reform #4: Larger Sample Sizes
Researchers are recognizing that studies with 20 participants per condition are underpowered.
New standards:
- Power analysis before data collection
- Samples of hundreds, not dozens
- Multi-lab collaborations for large-scale data
Example: The Many Labs project runs replications across dozens of labs worldwide with thousands of participants.
Reform #5: Replication Culture
Replications are finally being valued:
- Journals dedicated to replication (e.g., Royal Society Open Science)
- Replication studies can be published in top journals
- Career credit for conducting replications
- Understanding that replication is essential, not an attack
Reform #6: Meta-Science
Researchers are studying the research process itself:
- How often do studies replicate?
- What factors predict replication success?
- How can we improve scientific practices?
This is science examining its own methods. It’s essential for progress.
Reform #7: Better Statistical Practices
Movement away from blind reliance on p < 0.05:
- Report effect sizes and confidence intervals
- Use Bayesian statistics (gives probability of hypotheses)
- Consider practical significance, not just statistical significance
- Avoid dichotomous thinking (significant vs. not significant)
What This Means For You: How to Read Psychology Research
You’re not a researcher, but you read about psychology studies in the news, self-help books, and articles (like this one).
How should you think about scientific claims?
Guideline #1: Be Skeptical of Single Studies
When you see: “New study shows that [surprising finding]!”
Think: “Interesting preliminary finding that needs replication.”
Don’t immediately change your behavior based on one study.
Guideline #2: Look for Replication and Meta-Analyses
Better evidence:
- Multiple independent replications
- Meta-analyses (statistical synthesis of many studies)
- Pre-registered studies
- Large samples
Example:
“Does meditation reduce stress?”
- Single study with 30 participants: Weak evidence
- Meta-analysis of 47 randomized controlled trials with 3,500 participants: Strong evidence
Strength of evidence scales with convergence across studies.
Guideline #3: Check the Sample Size
Red flags:
- “Study of 20 college students shows…”
- “Experiment with 15 participants finds…”
This doesn’t mean the finding is wrong, but it means:
- Treat it as exploratory
- Wait for replication
- Don’t generalize broadly
Guideline #4: Beware Extraordinary Claims
As Carl Sagan said: “Extraordinary claims require extraordinary evidence.”
If a study claims:
- “This simple trick changes your life!”
- “Scientists discover the secret to [success/happiness/health]!”
- “Revolutionary finding upends everything we thought!”
Demand:
- Large samples
- Multiple replications
- Pre-registration
- Plausible mechanism
Most likely:
- Small sample got a lucky result
- P-hacking produced false positive
- Media exaggerated modest finding
Guideline #5: Understand the Difference Between “Significant” and “Large”
A study might report: “Mindfulness training significantly improved focus (p < 0.05)”
Questions to ask:
- How much did it improve? (effect size)
- Is that improvement meaningful in real life?
- How long does it last?
Statistical significance doesn’t equal practical importance.
Guideline #6: Consider the Source
Higher credibility:
- Pre-registered studies
- Registered reports
- Studies with open data
- Multi-lab collaborations
- Recent studies (post-crisis reforms)
Lower credibility:
- Studies from pre-2015 without replication
- Studies that refuse to share data
- Studies with conflicts of interest
- Studies contradicting robust meta-analyses
Guideline #7: Don’t Dismiss All Psychology Research
The replication crisis doesn’t mean: “All psychology is fake!”
It means: “Some published findings are false or exaggerated. We need better methods to distinguish real from false.”
Much psychology research is solid:
- Well-replicated cognitive phenomena
- Robust clinical interventions (CBT, exposure therapy)
- Psychometrics (Big Five personality, IQ)
- Developmental psychology basics
- Social psychology with large effects
The crisis is about improving standards, not abandoning the field.
My Personal Takeaway: Embracing Uncertainty
The replication crisis changed how I think about knowledge.
Before: “Science says X. I believe X. I’ll live accordingly.”
After: “Science suggests X with [low/medium/high] confidence based on [single study/multiple studies/meta-analyses]. I’ll update my beliefs proportionally and remain open to new evidence.”
This is uncomfortable.
We want certainty. We want clear answers. We want experts to tell us what’s true.
But science doesn’t provide certainty. It provides:
- Degrees of confidence
- Provisional conclusions
- Best current evidence
- Probabilistic claims
And that’s okay.
Living with uncertainty is more honest than false confidence.
What I Still Trust
High confidence:
- Core psychological principles with decades of replication (e.g., classical conditioning, cognitive biases that replicate)
- Interventions with large effect sizes in randomized controlled trials (e.g., CBT for anxiety)
- Meta-analyses of robust effects
Medium confidence:
- Recent findings with pre-registration and decent sample sizes
- Effects that replicate across multiple labs
- Findings consistent with related evidence
Low confidence:
- Single studies, especially with small samples
- Surprising findings without replication
- Studies from before the reform era
I update these as new evidence emerges.
What I’ve Changed
I no longer:
- Cite single studies as proof
- Build entire systems around preliminary findings
- Trust flashy claims without checking replication
- Assume published = true
I now:
- Look for meta-analyses and replications
- Check sample sizes
- Prefer pre-registered studies
- Hold claims loosely until confirmed
- Update beliefs with new evidence
This makes me a slower adopter of new ideas. But it makes me right more often.
The Bigger Picture: What Is Science For?
The replication crisis is humbling and important.
It teaches us:
1. Science is a process, not a collection of facts
Science doesn’t “prove” things. It gathers evidence, tests hypotheses, updates models, and gradually converges on truth (hopefully).
2. Science is done by humans with human flaws
Researchers aren’t objective machines. They have biases, incentives, and limitations. Systems must account for this.
3. Science requires skepticism—even of itself
The replication crisis was discovered by scientists questioning their own field. This is science working as it should (eventually).
4. Uncertainty is honest
Admitting “we don’t know yet” or “the evidence is mixed” is more scientific than false confidence.
5. Reforms work, but require collective action
Pre-registration, open data, and replication culture are fixing the crisis. But they require journals, funders, and institutions to change incentives.
Final Thoughts: Trust Science, But Understand Science
Should you trust psychology research?
Yes, but wisely.
- Trust the scientific method (hypothesis → test → replicate → revise)
- Trust the community’s ability to self-correct (eventually)
- Trust strong, replicated, converging evidence
Don’t trust:
- Single flashy studies
- Preliminary findings reported as facts
- Research done under perverse incentives without safeguards
The replication crisis isn’t a reason to reject science. It’s a reason to demand better science.
And as consumers of research—whether you’re reading self-help books, making health decisions, or just trying to understand yourself—you can be part of the solution by:
- Asking critical questions
- Demanding evidence quality
- Accepting uncertainty
- Updating beliefs with new evidence
Science is messy, slow, and uncertain.
But it’s still the best method we have for understanding reality.
How has the replication crisis changed how you think about research? What studies have you believed that turned out to be questionable? I’d love to hear your thoughts.