Calibration / Regression Example¶
📋 Standard Header¶
Purpose: Build calibration models to predict continuous quality metrics (mixture fractions, degradation scores) from spectra.
When to Use: - Predict adulterant concentration in blends - Estimate degradation score (peroxide value, oxidation index) - Quantify moisture content or impurity levels - Calibrate spectral features to reference lab measurements - Validate spectroscopic methods against gold-standard analyses
Inputs:
- Format: HDF5 or CSV with spectra + reference values
- Required metadata: target_value (continuous quality metric)
- Optional metadata: batch, replicate_id, reference_method
- Wavenumber range: Same as classification workflows (600–1800 cm⁻¹)
- Min samples: 50+ with diverse target values (span full quality range)
Outputs: - calibration_curve.png — Predicted vs true scatter with diagonal - residual_plot.png — Residuals vs predicted values - metrics.json — RMSE, MAE, R², MAPE - model.pkl — Trained PLS or MLP regressor - report.md — Calibration performance and prediction uncertainty
Assumptions: - Target values measured accurately (low reference method error) - Linear or mildly non-linear relationship between spectra and target - Training samples span full operational range of target values - No extrapolation beyond training range (predictions within calibrated domain)
🔬 Minimal Reproducible Example (MRE)¶
import numpy as np
import matplotlib.pyplot as plt
from foodspec.chemometrics.models import make_pls_regression, make_mlp_regressor
from foodspec.chemometrics.validation import compute_regression_metrics
from foodspec.viz.regression import plot_calibration_curve, plot_residual_plot
from foodspec.stats import bootstrap_metric
from sklearn.model_selection import cross_val_predict
# Generate synthetic regression data
np.random.seed(42)
n_samples, n_features = 120, 15
X = np.random.normal(0, 1, size=(n_samples, n_features))
true_coefs = np.random.normal(0.4, 0.2, size=n_features)
y = X @ true_coefs + np.random.normal(0, 0.4, size=n_samples)
print(f"Samples: {n_samples}")
print(f"Features: {n_features}")
print(f"Target range: {y.min():.2f} to {y.max():.2f}")
# PLS Regression (linear baseline)
model_pls = make_pls_regression(n_components=5)
model_pls.fit(X, y)
y_pred_pls = cross_val_predict(model_pls, X, y, cv=5)
metrics_pls = compute_regression_metrics(y, y_pred_pls)
print(f"\nPLS Regression Metrics:")
print(f" RMSE: {metrics_pls['rmse']:.3f}")
print(f" MAE: {metrics_pls['mae']:.3f}")
print(f" R²: {metrics_pls['r2']:.3f}")
print(f" MAPE: {metrics_pls['mape']:.1f}%")
# MLP Regression (non-linear option)
model_mlp = make_mlp_regressor(
hidden_layer_sizes=(64, 32),
max_iter=400,
random_state=0
)
model_mlp.fit(X, y)
y_pred_mlp = cross_val_predict(model_mlp, X, y, cv=5)
metrics_mlp = compute_regression_metrics(y, y_pred_mlp)
print(f"\nMLP Regression Metrics:")
print(f" RMSE: {metrics_mlp['rmse']:.3f}")
print(f" MAE: {metrics_mlp['mae']:.3f}")
print(f" R²: {metrics_mlp['r2']:.3f}")
print(f" MAPE: {metrics_mlp['mape']:.1f}%")
# Bootstrap confidence intervals
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred) ** 2))
boot_pls = bootstrap_metric(
rmse,
y,
y_pred_pls,
n_bootstrap=500,
random_state=0
)
print(f"\nPLS RMSE Bootstrap CI: [{boot_pls['ci'][0]:.3f}, {boot_pls['ci'][1]:.3f}]")
# Plot calibration curve
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# PLS calibration
plot_calibration_curve(y, y_pred_pls, ax=axes[0])
axes[0].set_title(f"PLS Calibration (R²={metrics_pls['r2']:.3f})")
axes[0].set_xlabel("True Value")
axes[0].set_ylabel("Predicted Value")
# Residual plot
plot_residual_plot(y_pred_pls, y - y_pred_pls, ax=axes[1])
axes[1].set_title("Residual Plot")
axes[1].set_xlabel("Predicted Value")
axes[1].set_ylabel("Residual (True - Predicted)")
axes[1].axhline(0, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig("calibration_example.png", dpi=150, bbox_inches='tight')
print("\nSaved: calibration_example.png")
Expected Output:
Samples: 120
Features: 15
Target range: -2.87 to 3.45
PLS Regression Metrics:
RMSE: 0.432
MAE: 0.342
R²: 0.887
MAPE: 24.3%
MLP Regression Metrics:
RMSE: 0.398
MAE: 0.315
R²: 0.905
MAPE: 22.1%
PLS RMSE Bootstrap CI: [0.385, 0.483]
Saved: calibration_example.png
✅ Validation & Sanity Checks¶
Success Indicators¶
Calibration Curve: - ✅ Points cluster tightly around diagonal (y = x) - ✅ No systematic bias (residuals centered at zero) - ✅ Prediction error uniform across target range (homoscedastic)
Metrics: - ✅ R² > 0.85 (85% variance explained) - ✅ RMSE < 10% of target range - ✅ MAPE < 15% (mean absolute percentage error)
Residuals: - ✅ Residuals normally distributed (Q-Q plot linear) - ✅ No heteroscedasticity (residual variance constant) - ✅ No outliers beyond 3 SD
Failure Indicators¶
⚠️ Warning Signs:
- Calibration curve shows S-shape (non-linearity)
- Problem: PLS insufficient; relationship non-linear
-
Fix: Try MLP regressor; add polynomial features; check for saturation effects
-
R² < 0.70
- Problem: Poor predictive power; target not correlated with spectra
-
Fix: Check preprocessing; verify target values correct; increase sample size
-
Residuals increase with predicted value (cone shape)
- Problem: Heteroscedasticity; model uncertainty higher at extremes
-
Fix: Transform target (log); use weighted regression; collect more samples at extremes
-
Large outliers (residuals > 3 SD)
- Problem: Reference measurement error; sample mislabeling; matrix effects
-
Fix: Investigate outliers; remove if justified; check sample quality
-
MLP RMSE >> PLS RMSE (overfitting)
- Problem: MLP too complex; insufficient regularization
- Fix: Reduce hidden layer size; increase dropout; use PLS instead
Quality Thresholds¶
| Metric | Minimum | Good | Excellent |
|---|---|---|---|
| R² | 0.75 | 0.88 | 0.95 |
| RMSE (% of range) | < 15% | < 8% | < 5% |
| MAPE | < 20% | < 12% | < 8% |
| Residual Normality (Shapiro p) | > 0.05 | > 0.10 | > 0.20 |
⚙️ Parameters You Must Justify¶
Critical Parameters¶
1. Regression Method - Parameter: Model type (PLS, Ridge, MLP) - Default: PLS (linear baseline) - When to adjust: Use MLP if calibration curve clearly non-linear; always benchmark against PLS - Justification: "PLS regression (5 components) used as baseline; captures linear relationships while avoiding overfitting."
2. Number of PLS Components
- Parameter: n_components
- Default: 5–10
- When to adjust: Use cross-validation to choose; increase if underfitting
- Justification: "Five PLS components chosen via cross-validation; captures 92% cumulative variance in X."
3. Cross-Validation Strategy
- Parameter: cv (number of folds)
- Default: 5-fold CV
- Justification: "Five-fold cross-validation used to estimate unbiased prediction error."
4. Target Value Range - Parameter: Min/max of training targets - Critical: Must report; predictions outside range unreliable - Justification: "Calibration valid for target values 0.5–5.0 (training range); extrapolation beyond this range not recommended."
5. Bootstrap Iterations
- Parameter: n_bootstrap (for confidence intervals)
- Default: 500
- Justification: "Bootstrap confidence intervals (500 iterations) quantify prediction uncertainty."
Data and setup¶
- Synthetic spectral features are used here for illustration; replace with real ratios/PCs in practice.
- Preprocessing would normally precede this step (baseline, smoothing, normalization).
Code example (PLS regression + metrics + robustness)¶
import numpy as np
from foodspec.chemometrics.models import make_pls_regression
from foodspec.chemometrics.validation import compute_regression_metrics
from foodspec.stats import bootstrap_metric, permutation_test_metric
rng = np.random.default_rng(42)
n_samples, n_features = 120, 15
X = rng.normal(0, 1, size=(n_samples, n_features))
true_coefs = rng.normal(0.4, 0.2, size=n_features)
y = X @ true_coefs + rng.normal(0, 0.4, size=n_samples)
model = make_pls_regression(n_components=5)
model.fit(X, y)
y_pred = model.predict(X).ravel()
metrics = compute_regression_metrics(y, y_pred)
print(metrics) # RMSE, MAE, R^2
# Optional: MLP regression if non-linear bias persists
from foodspec.chemometrics.models import make_mlp_regressor
mlp = make_mlp_regressor(hidden_layer_sizes=(64, 32), max_iter=400, random_state=0)
mlp.fit(X, y)
y_pred_mlp = mlp.predict(X)
mlp_metrics = compute_regression_metrics(y, y_pred_mlp)
print("MLP metrics:", mlp_metrics)
def rmse(a, b):
return np.sqrt(np.mean((a - b) ** 2))
boot = bootstrap_metric(rmse, y, y_pred, n_bootstrap=500, random_state=0)
perm = permutation_test_metric(rmse, y, y_pred, n_permutations=500, metric_higher_is_better=False, random_state=0)
print("Bootstrap CI:", boot["ci"], "Permutation p-value:", perm["p_value"])

Figure: Predicted vs true values for a PLS regression on synthetic data. Points close to the diagonal indicate good calibration; systematic deviation signals bias. Generated via docs/examples/ml/generate_regression_calibration_figure.py.
Optionally add uncertainty/agreements (DL optional—use only with sufficient data and always benchmark against PLS/linear baselines):
from foodspec.viz import plot_calibration_with_ci, plot_bland_altman
ax = plot_calibration_with_ci(y_true, y_pred)
ax.figure.savefig("calibration_ci.png", dpi=150)
ax = plot_bland_altman(y_true, y_pred)
ax.figure.savefig("bland_altman.png", dpi=150)
Reporting¶
- Report RMSE/MAE/R² with confidence intervals (bootstrap) and, if needed, permutation p-values for chance-level checks.
- Include predicted-vs-true plots and residual diagnostics for transparency.
- Note preprocessing steps, feature choices (ratios/PCs), model settings (components), and validation design.
Qualitative & quantitative interpretation¶
- Qualitative: Predicted vs true should cluster around the 1:1 line; residuals should be structureless and homoscedastic.
- Quantitative: Report RMSE/MAE/R² (and adjusted R² if multiple predictors); consider bootstrap CIs and permutation checks for small n. Add CI bands on calibration plots (
plot_calibration_with_ci) and, when comparing methods, use Bland–Altman to assess agreement (bias, limits). Link to Metrics & evaluation and Hypothesis testing for supporting stats. - Reviewer phrasing: “Calibration achieved R² = … and RMSE = …; residuals show no trend with fitted values, suggesting adequate model form.”
When Results Cannot Be Trusted¶
⚠️ Red flags for calibration/regression workflow:
- Extremely high R² (0.99+) on small dataset without independent validation
- High R² on training data doesn't guarantee generalization
- Overfitting: model learns noise, not true relationship
-
Fix: Use cross-validation or hold-out test set; bootstrap confidence intervals on R²
-
Calibration range too narrow (model trained on 10–20% property range, deployed on 0–50% range)
- Linear relationships valid only in training range
- Extrapolation beyond training range produces unreliable predictions
-
Fix: Ensure calibration samples span full operational range; mark and test extrapolation regions separately
-
Calibration standards from single source (all "low", "medium", "high" from same batch/supplier)
- Intra-source variability unknown; model may learn supplier-specific patterns
- Different sources with same property value may have different spectra
-
Fix: Include multiple sources per property level; validate on independent reference materials
-
Residuals show systematic trend with fitted values (heteroscedasticity)
- Violates homogeneity assumption; confidence intervals unreliable
- May indicate nonlinear relationship or missing variable
-
Fix: Visualize residuals vs fitted; log-transform if variance increases; consider nonlinear model
-
No replication in calibration (each standard measured once)
- Measurement error unquantified; precision of calibration unknown
- Single outlier can disproportionately influence fit
-
Fix: Measure each standard ≥3 times; report residual SD; use robust regression
-
Calibration model not validated on new samples (RMSE only computed on training data)
- Training metrics optimistic; test set RMSE is ground truth
- Real-world performance may be worse
-
Fix: Hold out independent test set; cross-validate; measure RMSE on truly new samples
-
Reference method uncertainty not considered (assuming reference measurements are error-free)
- Reference method has uncertainty; can't achieve better precision than reference
- Model R² inflated if reference error ignored
-
Fix: Quantify reference method error; report measurement uncertainty for calibration samples
-
Instrumental drift over calibration period not checked (calibration measured over weeks with no QC)
- Drift shifts spectral baselines; affects all samples
- Calibration may fit drift, not true property relationship
- Fix: Include QC standards throughout calibration; check for time-dependent drift; recalibrate periodically