Why our calibration framework dropped quadruple-signal stacking entirely

n=10–30 cohorts, looks like winners, fails every robustness test. The quietest "alpha" in the literature comes from sample sizes you wouldn't trust for a coin flip.
Published 2026-05-01 · Interactive Market Data Research

If two stacked signals beat one, and three beat two, then four should beat three. The math says r/σ keeps tightening as each independent conditioning event reduces the cohort to higher- quality observations. Some published quant strategies sit on quadruple- and quintuple-stacks. We tried it. The numbers looked elite. We don't publish them.

What we found

Our 12-signal library produces 495 possible 4-of-12 combinations. After event-construction (where each combination requires all four signals to fire on the same cik+fy), the n distribution looks like:

Combinations	Median n	p75 n	p95 n
Pairs (66 total)	403	687	4,148
Triples (220 total)	97	171	402
Quadruples (495 total)	16	34	89

Half of our quadruples have n ≤ 16. The "best" by r/σ all sit between n=8 and n=23. Those are sample sizes you wouldn't accept for a clinical trial of cough syrup, let alone a $50M short.

What "looks like winners" means at n=15

One quadruple from the early sweep: capex_spike + dso_drift_severe + accruals_quality_low + margin_compression_severe. n=15 events. 20-day mean return −12.4%. Standard deviation 8.1%. r/σ −1.53. That's a publishable headline number — looks like institutional alpha.

Then we look at the cohort:

11 of 15 events were industrials with SIC 3500-3599 (general machinery). One sector, one industry sub-bucket.
4 of 15 events were two related-party companies (Owens-Illinois and Owens Corning) over different fiscal years. Effective sample n=11 unique entities.
3 events landed in calendar 2009 (post-financial crisis); 2 in calendar 2020 (COVID); the rest scattered. Two narrow regime windows account for 33% of the events.

The "alpha" is two crisis windows applied to a narrow industry slice with two duplicate firms. Pull any of those out and r/σ goes to ~−0.4. The quadruple isn't conditioning on independent signals — it's conditioning on a small set of crisis-distressed industrial filings.

Why this happens systematically

Cohort independence breaks at n < ~50. Three forms:

Same-ticker multi-firing. Companies with chronic stress fire the same signals year after year. We dedupe by (cik, fy, signal_set), but four-signal cohorts often repeat the same 5-10 companies across multiple years. Our concentration cap (no single ticker > 30% of events) eliminates the egregious cases but doesn't catch the "5 companies × 3 fiscal years each" pattern.
Sector concentration. The signals in our library weren't designed to be orthogonal — most of them are sensitive to the same underlying business deterioration patterns. Stacking four forensic signals over-selects for one industry's distress shape (typically industrials or commodity producers). When the cohort becomes 70%+ one sector, the "alpha" is sector beta.
Regime concentration. Forensic signals fire heavily during recessionary periods (2008-2009, 2020-2021). Quadruple stacks compress the cohort into those windows, then claim alpha that is really a recession beta. The effect compounds with sector concentration: industrials in 2009 carry massive shared exposure to the same macro shock.

None of these are "dishonesty" — they're statistical inevitabilities that emerge as you condition on more independent events with finite n. By the time you've conditioned on 4 signals, the surviving cohort has lost most of its independence claims.

What we publish instead

We cap at triple-stacking. Triples have a median n of 97 and require validation across multi-period (does the alpha hold pre-2015 vs post-2020?), multi-sector (do at least 4 SIC divisions appear?), and out-of-sample (train/test split at the event level). Our 247 published triples all pass these checks. The quadruple sweep produced ~30 candidates that passed at all; we verified they were actually 30 LBRDA-style duplications of triples plus one additional ticker, not real conditioning.

The public calibration leaderboard at /api/calibration/triples reflects this: n≥30 minimum, validation flags on every entry, sector/period diversity stats included. No quadruples. The deepest-distress patterns are the ones in the fallen-angel post-mortem — and even those are triples, not quadruples.

Honest take on the literature

Several published quant strategies use 4-, 5-, or even 6-signal stacks with reported r/σ of 1.5+ on n=20-50 cohorts. We don't trust those numbers. Three things to ask before believing a multi-signal stack:

What's the unique-ticker count? If 60%+ of events come from < 10 tickers, the alpha is single-name idiosyncrasy.
What's the sector distribution? If one SIC division accounts for > 50% of events, the alpha is sector beta and won't generalize.
What's the period distribution? If > 40% of events landed in two or fewer recession windows, the alpha is recession beta.

Run those three checks against any published stack with n < 50. We've yet to see one survive all three.

The discipline to drop the impressive-but-fake numbers is the product. Our calibration page is honest about what we've tested and what we threw away. The framework is the moat.

For the live triple leaderboard: /api/calibration/triples. For the validation logic: /research/validation-framework-99-99. For the fallen-angel discovery story: /research/fallen-angel-post-mortem.

Related signals

Citations

Beneish, M.D. (1999). 'The Detection of Earnings Manipulation.' Financial Analysts Journal.
Cooper, M.J., Gulen, H., & Schill, M.J. (2008). 'Asset Growth and the Cross-Section of Stock Returns.' Journal of Finance.
Sloan, R.G. (1996). 'Do stock prices fully reflect information in accruals.' The Accounting Review.
Interactive Market Data internal: project_calibration_v2_2026_04_27 (the original triple-stack analysis)
Interactive Market Data internal: project_signal_combo_2026_04_29 (multi-signal combo predictions)