Why our calibration framework dropped quadruple-signal stacking entirely
If two stacked signals beat one, and three beat two, then four should beat three. The math says r/σ keeps tightening as each independent conditioning event reduces the cohort to higher- quality observations. Some published quant strategies sit on quadruple- and quintuple-stacks. We tried it. The numbers looked elite. We don't publish them.
What we found
Our 12-signal library produces 495 possible 4-of-12 combinations. After event-construction (where each combination requires all four signals to fire on the same cik+fy), the n distribution looks like:
| Combinations | Median n | p75 n | p95 n |
|---|---|---|---|
| Pairs (66 total) | 403 | 687 | 4,148 |
| Triples (220 total) | 97 | 171 | 402 |
| Quadruples (495 total) | 16 | 34 | 89 |
Half of our quadruples have n ≤ 16. The "best" by r/σ all sit between n=8 and n=23. Those are sample sizes you wouldn't accept for a clinical trial of cough syrup, let alone a $50M short.
What "looks like winners" means at n=15
One quadruple from the early sweep:
capex_spike + dso_drift_severe + accruals_quality_low +
margin_compression_severe. n=15 events. 20-day mean return
−12.4%. Standard deviation 8.1%. r/σ −1.53.
That's a publishable headline number — looks like institutional alpha.
Then we look at the cohort:
- 11 of 15 events were industrials with SIC 3500-3599 (general machinery). One sector, one industry sub-bucket.
- 4 of 15 events were two related-party companies (Owens-Illinois and Owens Corning) over different fiscal years. Effective sample n=11 unique entities.
- 3 events landed in calendar 2009 (post-financial crisis); 2 in calendar 2020 (COVID); the rest scattered. Two narrow regime windows account for 33% of the events.
The "alpha" is two crisis windows applied to a narrow industry slice with two duplicate firms. Pull any of those out and r/σ goes to ~−0.4. The quadruple isn't conditioning on independent signals — it's conditioning on a small set of crisis-distressed industrial filings.
Why this happens systematically
Cohort independence breaks at n < ~50. Three forms:
- Same-ticker multi-firing. Companies with chronic stress fire the same signals year after year. We dedupe by (cik, fy, signal_set), but four-signal cohorts often repeat the same 5-10 companies across multiple years. Our concentration cap (no single ticker > 30% of events) eliminates the egregious cases but doesn't catch the "5 companies × 3 fiscal years each" pattern.
- Sector concentration. The signals in our library weren't designed to be orthogonal — most of them are sensitive to the same underlying business deterioration patterns. Stacking four forensic signals over-selects for one industry's distress shape (typically industrials or commodity producers). When the cohort becomes 70%+ one sector, the "alpha" is sector beta.
- Regime concentration. Forensic signals fire heavily during recessionary periods (2008-2009, 2020-2021). Quadruple stacks compress the cohort into those windows, then claim alpha that is really a recession beta. The effect compounds with sector concentration: industrials in 2009 carry massive shared exposure to the same macro shock.
None of these are "dishonesty" — they're statistical inevitabilities that emerge as you condition on more independent events with finite n. By the time you've conditioned on 4 signals, the surviving cohort has lost most of its independence claims.
What we publish instead
We cap at triple-stacking. Triples have a median n of 97 and require validation across multi-period (does the alpha hold pre-2015 vs post-2020?), multi-sector (do at least 4 SIC divisions appear?), and out-of-sample (train/test split at the event level). Our 247 published triples all pass these checks. The quadruple sweep produced ~30 candidates that passed at all; we verified they were actually 30 LBRDA-style duplications of triples plus one additional ticker, not real conditioning.
The public calibration leaderboard at /api/calibration/triples reflects this: n≥30 minimum, validation flags on every entry, sector/period diversity stats included. No quadruples. The deepest-distress patterns are the ones in the fallen-angel post-mortem — and even those are triples, not quadruples.
Honest take on the literature
Several published quant strategies use 4-, 5-, or even 6-signal stacks with reported r/σ of 1.5+ on n=20-50 cohorts. We don't trust those numbers. Three things to ask before believing a multi-signal stack:
- What's the unique-ticker count? If 60%+ of events come from < 10 tickers, the alpha is single-name idiosyncrasy.
- What's the sector distribution? If one SIC division accounts for > 50% of events, the alpha is sector beta and won't generalize.
- What's the period distribution? If > 40% of events landed in two or fewer recession windows, the alpha is recession beta.
Run those three checks against any published stack with n < 50. We've yet to see one survive all three.
The discipline to drop the impressive-but-fake numbers is the product. Our calibration page is honest about what we've tested and what we threw away. The framework is the moat.
For the live triple leaderboard: /api/calibration/triples. For the validation logic: /research/validation-framework-99-99. For the fallen-angel discovery story: /research/fallen-angel-post-mortem.
Related signals
capex_spikefcf_ni_divergencefcf_turn_negativemargin_compression_severeaccruals_quality_lowdso_drift_severebeneish_m_score_high
Citations
- Beneish, M.D. (1999). 'The Detection of Earnings Manipulation.' Financial Analysts Journal.
- Cooper, M.J., Gulen, H., & Schill, M.J. (2008). 'Asset Growth and the Cross-Section of Stock Returns.' Journal of Finance.
- Sloan, R.G. (1996). 'Do stock prices fully reflect information in accruals.' The Accounting Review.
- Interactive Market Data internal: project_calibration_v2_2026_04_27 (the original triple-stack analysis)
- Interactive Market Data internal: project_signal_combo_2026_04_29 (multi-signal combo predictions)