Stacking is real, but only honestly: pair vs. triple vs. single signal calibration

What happens when you require two or three forensic signals to fire on the same company in the same fiscal year — and where the math still lies to you
Published 2026-04-30 · Interactive Market Data Research

If a single forensic signal predicts X returns over 20 days, what does requiring two signals — fired on the same company, in the same fiscal year — predict? Conventional wisdom says "more". The honest answer requires more care than the literature gives it.

The single-signal baseline

Our 12-signal library has been calibrated 5 ways: gross stock return, direction-applied PnL, net-of-execution-cost PnL, hit rate, and r/σ (mean-over-stdev). The "best" single short-side signal is capex_spike:

Horizon	n	Gross stock return	Net PnL (short)
1d	2,295	+0.01%	−0.06%
5d	2,292	−0.85%	+0.74%
20d	2,258	−1.91%	+2.05%
60d	2,127	+0.65%	−0.78%
252d	2,092	+14.03%	−14.91%

Notice the sign change between 20d and 60d. This single signal works as a 20-day short, then mean-reverts hard at year horizons (we wrote about why in the fallen-angel post-mortem). Most people stop at 20d.

The pair lift

Now require two signals to fire on the same (cik, fy): capex_spike AND fcf_turn_negative. The intuition: if both fire, the company is more deeply in trouble than if just one fires.

Cohort	n	20d gross return	20d net PnL (short)	r/σ
capex_spike alone	2,258	−1.91%	+2.05%	+0.10
capex_spike + fcf_turn_negative	~600	−2.65%	+3.92%	+0.16

The pair more than doubles the alpha and lifts r/σ. This is real — but only because the pair is genuinely a different cohort. Companies that fire both of these signals had two independent triggers align in the same fiscal year. Conditional probability gets you here.

The triple — and where the snake oil starts

If two stacked signals beat one, three should beat two. The mathematics are intoxicating: each additional independent signal that genuinely conditions the cohort should drive r/σ further. Some quant-trading literature claims r/σ of −1.7 from triples — the level where you can size a $100M short and have institutional risk officers sign off.

We initially saw exactly that. Our first triple-stack candidate was capex_spike + positive_eps_streak + zombie_alert: n=72, 20d mean −5.89%, r/σ −1.72. Elite numbers.

The problem: 64 of those 72 events were the same company. Liberty Broadband (LBRDA) had triggered the same triple year after year, fiscal period after fiscal period, share class after share class. The "diverse" n=72 cohort was effectively n=8.

The lift wasn't from genuine three-way conditioning. It was from LBRDA's price action over our sample window getting amplified through 68× duplicate counting. We covered the full forensic story in the fallen-angel post-mortem.

The validated triple

After we added a per-cohort concentration cap (no single ticker can exceed 30% of events) and deduped (cik, fy, signal_set) at event construction time, the strongest validated 20-day triple is:

capex_spike + fcf_ni_divergence + fcf_turn_negative

Metric	Phantom triple (pre-fix)	Validated triple (post-fix)
n events	72 claimed	114 (real)
n unique tickers	9 (effectively 1)	89
20d mean return	−5.89%	−3.78%
r/σ	−1.72	−0.39
Validated multi-period	NO	YES
Validated multi-sector	NO	YES
Validated OOS	NO	YES

The validated number is less spectacular but it's real. r/σ −0.39 is a tradeable signal — 4× a typical single-signal ratio, with diverse counterparties and statistical robustness. A fund running a $100M short book against this won't watch the backtest collapse on first contact with reality.

The deeper lesson

Most published "signal stacking" results in finance literature are contaminated by one of three subtle errors:

Same-ticker multi-firing. A long-tenured stock with chronic stress fires the same signals year after year. Without dedup by (cik, fy), the cohort inflates with non-independent observations. Standard error formulas assume i.i.d. — they're wrong by 4-10× when this is happening.
Survivor bias at long horizons. Companies that fire 3+ stress signals are deeply impaired. The 30% that go bankrupt or delist disappear from the price panel — their forward returns are NULL, not a big negative. The 70% that survive carry the average, and they survived because they bounced. Multi-stress stacks calibrate as deep-value LONGS at 252d, not shorts.
Calibration on the same cohort used for selection. The cohort generating the alpha is the cohort being tested. Out-of-sample validation is not a checkbox; it's the only calibration that counts.

Our validation framework now catches all three:

Concentration cap: any ticker contributing >30% blocks the signal
Net-of-cost calibration includes a survivor-bias caveat on every long-horizon stack
Train/test split at the event level for every published number

What the data actually tells us

We ran the full pair + triple analysis across every co-firing combination in our 12-signal library. 247 validated triples at 20d horizons. Across that universe, three patterns emerge:

Pattern 1: Short-side stacks work at 20d, fail at year horizons. Every single forensic-short triple in our library is profitable as a short at 20 days net of costs (capex + dso_drift + margin_compression: +3.0% net short / 20d) but reverses badly at 252d (the same triple loses 70%+ as a short — see the deep-value-long story).

Pattern 2: Quality + stress = inflection long. Pair a quality signal (positive_eps_streak, dividend_initiation) with a stress signal (margin_compression_severe, fcf_turn_negative) and the pair calibrates as a long at 20d. The combination identifies high-quality companies in a temporary down-cycle. Top pair: fcf_turn_positive + margin_compression_severe — +52.69% net as a long over 20 days, n=80 events.

Pattern 3: All-stress triples are deep-value LONGS at 252d. This was the surprise. Every top-ranked 252d triple by absolute |pnl_net| is a long: capex_spike + dso_drift_severe + margin_compression_severe returns +70.32% net over 252d, n=51, r/σ +0.20. The forensic literature treats these as bearish flags. The calibration says: buy the stocks.

The honest take on stacking

Pair-wise stacking adds real conditioning information. The pairs work, the math holds, the alpha is genuine for cohorts that survive the concentration filter. r/σ of 0.1 to 0.4 net of costs.

Triple-wise stacking gets you exactly that — slightly more conditioning, slightly tighter r/σ — at the cost of dramatic reductions in n. With our universe (~5,000 tickers × 15 years × 12 signals), most validated triples have n in the 30-150 range. Below n=50 the standard errors get loose enough that "alpha" is mostly sample noise.

Quadruple stacking we tried. It produces n=10-30 cohorts that look like winners but fail every robustness test. We don't publish quadruples because we don't believe them.

The product, as ever, is the validation framework — the thing that told us our first triple was 64 LBRDA copies and not real alpha. Without that, we'd have shipped a phantom signal and built a track record on top of survivor-bias-amplified noise. The same is true at every other shop publishing signal-stacking results: ask them to show their concentration distribution and their out-of-sample split. The ones who can't are publishing fiction.

Try the data yourself: /api/calibration/triples for the validated leaderboard, or pyflo_signal_combo(cik, fy) on the MCP server for a per-company prediction with the best-match stack already selected.

Related signals

Citations

Cooper, Gulen, Schill (2008). 'Asset Growth and the Cross-Section of Stock Returns.' Journal of Finance.
Sloan, R.G. (1996). 'Do stock prices fully reflect information in accruals and cash flows about future earnings?' The Accounting Review.
Beneish, M.D. (1999). 'The Detection of Earnings Manipulation.' Financial Analysts Journal.
Interactive Market Data internal: project_calibration_v2_2026_04_27 (the original triple-stack analysis)
Interactive Market Data internal: project_signal_combo_2026_04_29 (per-company combo predictions)