Skip to content

Metrics & Evaluation

Metrics tell you how well a workflow answers a scientific or QC question. Different food spectroscopy tasks (authentication, adulteration detection, spoilage monitoring, calibration) demand different metrics and plots. This chapter explains what to use, why, and how to interpret results in practice.

For notation and symbols, see the Glossary.

What?

Defines classification, regression, embedding-structure, and agreement metrics; pairs each with recommended plots. Inputs: true/pred labels or continuous targets (plus scores), embeddings, ratios. Outputs: metric dictionaries, plots (confusion matrix, ROC/PR, calibration, residuals, Bland–Altman), and uncertainty (bootstrapped CI).

Why?

Spectral tasks span authentication (imbalanced), calibration (bias/precision), and structure exploration (PCA/t-SNE). Metrics + plots + uncertainty prevent overclaiming and align results with chemistry/QA/QC expectations.

When?

Use when you need quantitative evidence for model performance, group separation, or agreement (model vs lab).
Limitations: small n → prefer CIs/bootstraps; imbalanced classes → prefer PR/F1 over accuracy; embeddings are qualitative unless paired with silhouette/between-within metrics.

Where? (pipeline)

Upstream: preprocessing → features → models/embeddings.
Downstream: stats tests (ANOVA/Games–Howell on features), reporting (metrics.json, plots).

flowchart LR
  A[Preprocess + features] --> B[Model / Embedding]
  B --> C[Metrics + plots + uncertainty]
  C --> D[Reporting / stats tables]

1. Why evaluation metrics matter

  • Accuracy alone can mislead (class imbalance, asymmetric costs).
  • Authentication/adulteration often prioritizes sensitivity/specificity or precision–recall.
  • Regression/calibration needs error magnitudes (RMSE/MAE) and bias checks (calibration plots).
  • Regulatory/QC contexts require transparent, interpretable metrics and diagnostics.

2. Classification metrics (binary & multi-class)

Metric Formula/intuition When to use Notes
Accuracy (TP+TN)/(all) Balanced classes Misleading if imbalance
Precision TP/(TP+FP) Rare positives, avoid false alarms High precision = few false positives
Recall (Sensitivity) TP/(TP+FN) Missing positives is costly High recall = few false negatives
Specificity TN/(TN+FP) Avoiding false positives Pair with sensitivity
F1-score Harmonic mean of precision/recall Balance precision/recall Use macro for per-class balance
Balanced accuracy Mean of recall across classes Imbalance
MCC Correlation of true vs pred labels Binary robustness Handles imbalance

Confusion matrix: counts per true/pred class; shows where errors occur.
ROC curve (FPR vs TPR); AUC summarizes ranking ability.
Precision–Recall curve: better for rare positives.

Math box (key classification metrics)
Specificity (= \frac{TN}{TN+FP}); Sensitivity/Recall (= \frac{TP}{TP+FN})
MCC (= \frac{TP\times TN - FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}})
F1 (= 2 \cdot \frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}})

Food task → best metrics | Task | Metrics | Reason | | --- | --- | --- | | Authentication | Sensitivity + specificity | False negatives/positives both matter | | Adulteration (rare) | Precision–recall, F1_macro | Rare-event imbalance | | Multi-class oil ID | F1_macro, balanced accuracy | Per-class balance | | Spoilage detection | ROC–AUC + recall | Missing spoiled samples is critical |

Code example

from foodspec.metrics import compute_classification_metrics
from foodspec.viz import plot_confusion_matrix, plot_roc_curve

res = compute_classification_metrics(y_true, y_pred, labels=class_labels, average="macro", y_scores=y_scores)
print(res["accuracy"], res["f1"])
plot_confusion_matrix(res["confusion_matrix"], class_labels)
if "roc_curve" in res:
    fpr, tpr = res["roc_curve"]
    plot_roc_curve(fpr, tpr, res.get("auc"))

Confusion matrix example

Figure: Synthetic confusion matrix illustrating per-class errors and supports.

Metrics selection flowchart

%%{init: {'theme':'neutral', 'themeVariables': { 'primaryColor': '#e8f0ff', 'secondaryColor': '#fbe9e7', 'lineColor': '#1f3044' }}}%%
flowchart LR
    A([❓ Task type]) --> B[Classification]
    A --> C[Regression]

    B --> B1{{Class imbalance?}}
    B1 -->|No| B2[👍 Accuracy + F1 + Confusion Matrix]
    B1 -->|Yes| B3[⚖️ Precision–Recall, F1, ROC–AUC, Confusion Matrix]

    C --> C1{{Calibration focus?}}
    C1 -->|Yes| C2[📈 RMSE, R², Calibration plot, Residuals]
    C1 -->|No| C3[📊 MAE, MAPE, R²]

For practical examples with real models, see Machine Learning & Deep Learning Models.

3. Regression & calibration metrics

Metric Intuition When to use
RMSE Penalizes large errors Calibration, continuous targets
MAE Median-like robustness Outliers present
R² Variance explained Overall fit
MAPE Relative (%) error Scale-free, beware near-zero targets

Diagnostics: predicted vs true, residual plots, calibration (bias/slope).
Use RMSE/MAE for magnitude; R² for proportion explained; MAPE for relative comparisons.

Math box (key regression metrics)
RMSE (= \sqrt{\frac{1}{n}\sum (y-\hat y)^2}); MAE (= \frac{1}{n}\sum |y-\hat y|)
(R^2 = 1 - \frac{\sum (y-\hat y)^2}{\sum (y-\bar y)^2}\); Adjusted (R^2 = 1 - (1-R^2)\frac{n-1}{n-p-1})
MAPE (= \frac{100}{n}\sum \left|\frac{y-\hat y}{y}\right|) (beware near zero targets)

Code example

from foodspec.metrics import compute_regression_metrics
from foodspec.viz import plot_regression_calibration, plot_residuals

res = compute_regression_metrics(y_true, y_pred)
print(res["rmse"], res["r2"])
plot_regression_calibration(y_true, y_pred)
plot_residuals(y_true, y_pred)

4. Statistical metrics and effect sizes

  • Effect size: Cohen’s d (two groups), eta²/partial eta² (ANOVA).
  • Confidence intervals and standard errors contextualize magnitude.
  • ANOVA outputs (F, p) should be paired with effect sizes and post-hoc tests.

5. Quantifying structure in embeddings (PCA/t-SNE)

  • Silhouette score: Measures how similar a sample is to its own class vs other classes (range -1 to 1). Higher = better separation.
  • Between/within scatter ratio: Ratio of average distances between class centroids to average within-class variance. >1 suggests separation; ~1 suggests overlap.
  • F-like statistic & permutation p-value: Between/within is analogous to an ANOVA F on the embedding. A permutation p-value (p_perm) tests whether the observed ratio is larger than random labelings.
from foodspec.metrics import (
    compute_embedding_silhouette,
    compute_between_within_ratio,
    compute_between_within_stats,
)

# Assume scores from PCA or t-SNE and integer labels
sil = compute_embedding_silhouette(scores[:, :2], labels)
bw = compute_between_within_ratio(scores[:, :2], labels)
stats = compute_between_within_stats(scores[:, :2], labels, n_permutations=200, random_state=0)
print(f"silhouette={sil:.2f}, bw={bw:.2f}, f={stats['f_stat']:.2f}, p_perm={stats['p_perm']:.3f}")

When to use: pair these metrics with PCA/t-SNE plots to avoid over-interpreting visuals; low silhouette or high p_perm suggests weak separation. For visuals (PCA scores/loadings and t-SNE) and reproducible figures, run python docs/examples/visualization/generate_embedding_figures.py, which saves pca_scores.png, pca_loadings.png, and tsne_scores.png using synthetic spectra.

6. Robustness (bootstrap/permutation)

from foodspec.stats.robustness import bootstrap_metric
from foodspec.metrics import compute_classification_metrics

def f1_metric(y_true, y_pred):
    return compute_classification_metrics(y_true, y_pred)["f1"]

obs, ci = bootstrap_metric(f1_metric, y_true, y_pred, n_bootstrap=200, random_state=0)
print(\"Observed F1\", obs, \"CI\", ci)
Bootstraps (or permutation tests) provide uncertainty around metrics; use them when sample sizes are small or decisions are high-stakes.

Bootstrap confidence intervals for metrics

from foodspec.metrics import bootstrap_metric_ci, compute_regression_metrics

rmse_fn = lambda yt, yp: compute_regression_metrics(yt, yp)["rmse"]
ci = bootstrap_metric_ci(y_true, y_pred, rmse_fn, n_bootstrap=500, alpha=0.05, random_state=0)
print(ci["metric"], ci["ci_low"], ci["ci_high"])
Use for accuracy/F1 or regression errors; CIs help communicate uncertainty around performance.

Agreement plots (Bland–Altman)

When comparing methods (e.g., model vs lab reference), use a Bland–Altman plot to show bias and limits of agreement:

from foodspec.viz import plot_bland_altman
ax = plot_bland_altman(y_true, y_pred)
Interpret bias (mean diff) and limits (mean Âą 1.96 SD). Systematic trends in diff vs mean indicate scale/offset issues.

7. Visualization & diagnostic plots

  • Confusion matrix: where classification fails; interpret per-class errors.
  • ROC curve/AUC: ranking ability; use for balanced/varied thresholds.
  • Precision–Recall curve: rare positives/adulteration.
  • Calibration curve & residuals: regression bias and spread.
  • Correlation heatmap: associations among ratios/PCs/quality metrics.
  • PCA scores/loadings: structure and band importance (contextual).

8. Metrics in workflows

  • Oil authentication: F1_macro, balanced accuracy; confusion matrix; ROC if binary; ANOVA/Tukey on key ratios.
  • Adulteration (rare): precision–recall, F1, PR curve; confusion matrix with class counts.
  • Heating degradation (regression/time): RMSE/MAE/R²; correlation of ratios vs time; residual plots.
  • Mixtures/calibration: RMSE/MAE/R²/MAPE; predicted vs true; residuals; optional PR for thresholded tasks.
  • Batch QC: confusion matrix for in/ out-of-distribution labels; specificity/sensitivity for QC thresholds. For practical model choices and troubleshooting, see ML & DL models and Common problems & solutions.

9. Practical guidance

  • Always report class supports and per-class metrics.
  • Use PR curves for rare events; ROC for balanced classification.
  • Pair p-values with effect sizes; pair metrics with uncertainty (CI/bootstrap).
  • Show diagnostic plots alongside headline numbers.

10. Group-difference validation

When comparing spectral features (ratios/peaks) across groups, complement metrics with statistical tests: - MANOVA to test multivariate group differences across several ratios/PCs. - Games–Howell post-hoc after ANOVA/Kruskal when variances/group sizes differ; safer for many spectroscopic datasets. See ANOVA & MANOVA for theory and code.


When Results Cannot Be Trusted

⚠️ Red flags for metrics and evaluation:

  1. High accuracy on imbalanced data without checking per-class metrics (90% accuracy from 90% majority-class prediction)
  2. Accuracy misleading on imbalanced tasks
  3. F1, precision, recall, or balanced accuracy more informative
  4. Fix: Report confusion matrix; per-class precision/recall; use weighted/balanced metrics

  5. Metrics reported without uncertainty (F1 = 0.85, no confidence interval or SD)

  6. Point estimates hide fold-to-fold variability
  7. Can't assess significance of model differences
  8. Fix: Report mean Âą SD across folds; compute bootstrap CI; visualize per-fold metrics

  9. Single metric used in isolation (only accuracy, ignoring confusion matrix or calibration)

  10. Each metric highlights different aspects
  11. Single metric can miss important failures
  12. Fix: Report multiple metrics (accuracy + F1 + AUC + confusion matrix); discuss trade-offs

  13. Test set metrics inflated by leakage (preprocessing computed on full data before train/test split)

  14. Data leakage inflates metrics; performance estimates unreliable
  15. Proper workflow: preprocessing inside CV loop
  16. Fix: Use Pipeline; compute statistics only on training data; audit for leakage sources

  17. Metrics sensitive to threshold not reported with threshold (AUC without reporting classification threshold)

  18. Binary classification metrics depend on threshold choice
  19. Default threshold (0.5) may not be optimal
  20. Fix: Tune threshold on validation set; report operating point; plot ROC/PR curves

  21. Model comparison without statistical testing (Model A 0.90, Model B 0.91 → "B is better")

  22. Small metric differences may not be significant
  23. Variability across folds needed to assess significance
  24. Fix: Perform paired t-test or Wilcoxon on fold-level metrics; report p-value

  25. Metrics computed on same data used for hyperparameter tuning (test set used for tuning)

  26. Tuning on test set overfits; metrics inflated
  27. Proper workflow: nested CV or separate tune/test sets
  28. Fix: Use nested CV (inner: tune, outer: evaluate); never tune on test set

  29. Regression metrics (R², RMSE) without residual plots (high R² but systematic residual bias)

  30. R² doesn't detect heteroscedasticity or non-linearity
  31. Residual plots reveal model failures
  32. Fix: Always plot residuals vs fitted; check for trends; investigate heteroscedasticity

See also