Model Lifecycle, Calibration, and Drift¶
This guide explains how FoodSpec supports production-grade model lifecycle management: calibration diagnostics, drift detection, aging/sunset rules, and champion–challenger promotion.
Calibration Diagnostics¶
- API:
foodspec.ml.calibration.compute_calibration_diagnostics(y_true, y_proba, n_bins=10) - Metrics returned: slope/intercept, bias, ECE/MCE, Brier score, reliability curve.
- Recalibration:
recalibrate_classifier(clf, X_cal, y_cal, method="platt"|"isotonic")wraps scikit-learn calibration. - When to recalibrate: elevated ECE (>0.05), slope far from 1, or production drift indicates miscalibration.
Drift Monitoring¶
- API:
foodspec.qc.drift.detect_feature_drift(ref, prod, threshold=0.1)returns KL + PSI per feature. - Production drift wrapper:
detect_production_drift(ref_df, prod_df, ref_perf, prod_perf)→ProductionDriftReport. - Recalibration trigger:
should_recalibrate(report, kl_thresh=0.2, perf_thresh=0.02)flags when to update calibration or retrain.
Model Aging & Sunset Rules¶
- Track performance over time with
ModelLifecycleTracker.record_performance(timestamp, metric_value, n_samples, **meta). - Compute aging:
ModelLifecycleTracker.compute_aging_score(current_time=...)→ModelAgingScore(age, decay, trend p-value, recommendation). - Sunset policy:
SunsetRule(max_age_days=..., min_performance=..., max_decay_rate=..., grace_period_days=7)applied insidecompute_aging_score. - Typical thresholds: max_age_days=180, min_performance=0.80, max_decay_rate=0.005/day after burn-in.
Champion–Challenger Promotion¶
- Compare models:
compare_champion_challenger(champion_scores, challenger_scores, test_type="mcnemar"|"paired_ttest"|"wilcoxon")→ significance, CI, recommendation. - Promote:
promote_challenger(comparison, champion_path, challenger_path, force=False)backs up champion and swaps artifacts when approved. - Interpretation:
✅ PROMOTEwhen challenger is significantly better;❌ REJECTwhen challenger underperforms.
Acting on QC Flags¶
- Runtime guard:
evaluate_prediction_qc(probs, drift_score=..., ece=...)returnsqc_do_not_trustand textual reasons (confidence, entropy, margin, drift, calibration). - CLI:
foodspec-predictnow emitsqc_do_not_trustandqc_notescolumns when probabilities are available; flagged rows also trigger a stderr warning. - Example handling:
do_not_trust, warnings = guard_prediction(probs, min_confidence=0.7)
if do_not_trust:
route_to_human_review(sample_id, warnings)
- Suggested policy: block automation when
qc_do_not_trust=True, logqc_notes, and trigger recalibration if many rows get flagged.
Recommended Workflow¶
1) Calibrate a candidate model on holdout data using Platt or isotonic. 2) Deploy with drift hooks: log predictions, inputs, and outcomes to feed drift + aging trackers. 3) Schedule drift checks (daily/weekly) with PSI/KL thresholds; trigger recalibration if drifted. 4) Track performance snapshots; if aging score flags decay or age > rule, schedule retrain/retire. 5) Run champion–challenger A/B: compare on fresh batches, promote when statistically better.
Reproducibility¶
- Use
foodspec.report.methodsto auto-generate manuscript-ready methods text. - Capture reproducibility status with
foodspec.report.checklistand journal presets underfoodspec.report.journals.