Skip to content

Classification and Regression

Purpose: Train supervised models (classifiers, regressors) on spectral features for authentication, QC, and calibration tasks.

Audience: Chemometricians, data scientists, researchers building food spectroscopy models.

Time: 20–30 minutes to read; reference during model selection.

Prerequisites: Familiarity with scikit-learn; understanding of cross-validation, metrics, and overfitting.


When to Use Each Model Type

For notation see the Glossary. For bands/ratios, see Feature extraction. Metrics: Metrics & Evaluation.

Decision Tree: Which Model to Use?

Your goal Use this model Why
Authenticate oils (binary: real vs. fake) SVM (RBF) or Random Forest Handles nonlinear boundaries; robust to noise
Classify multiple oil types (multiclass) Random Forest or PLS-DA RF scalable; PLS-DA interpretable
Predict mixture composition (continuous) PLS regression or SVR Designed for correlated features
Quick baseline (simple data) Logistic regression or kNN Fast, interpretable
Imbalanced data (rare adulteration) XGBoost + class weights Handles imbalance automatically

What?

Defines model families for classification (LogReg, SVM, RF, kNN, PLS-DA) and regression/calibration (PLS, SVR), with inputs (preprocessed features/ratios/PCs) and outputs (predictions, probabilities/scores, calibration curves, metrics).

Why?

Authentication/adulteration, QC/novelty, and calibration require models that handle correlated bands and small-to-medium n. Supervised models turn spectral variation into decisions or calibrated estimates, and must be paired with suitable metrics/plots to avoid overclaiming.

When?

Use:
- Linear/PLS-DA: modest n, interpretable boundaries, ratios/PCs roughly linear.
- RBF SVM/RF: nonlinear boundaries, richer feature sets.
- PLS/linear/SVR: continuous targets, mixtures, property prediction.
Limitations:
- Small n: risk of overfitting—use CV + CIs.
- Imbalance: accuracy can mislead—use macro F1/PR.
- Always standardize consistently; leakage if scaling across folds.

Where? (pipeline)

Upstream: baseline/smoothing/normalization → optional derivatives → features/ratios/PCs.
Model: classifier/regressor.
Downstream: metrics (F1/AUC/RMSE/R² + CIs), plots (confusion, ROC/PR, calibration/residual), stats tests on key ratios (ANOVA/Games–Howell), reporting.

flowchart LR
  A[Preprocess + features] --> B[Classifier / Regressor]
  B --> C[Metrics + plots + stats]
  C --> D[Interpretation & reporting]

Model families (at a glance)

  • Linear / PLS-DA: interpretable coefficients/loadings; good for smaller, near-linear problems.
  • SVM (linear/RBF): max-margin; RBF handles curved boundaries (tune C, gamma).
  • Random Forest / Ensembles: nonlinear, feature importances; robust to noisy predictors.
  • Boosted trees (GradientBoosting / XGBoost / LightGBM): strong tabular performance; handle interactions and imbalance; require tuning (learning rate, trees).
  • kNN: simple baseline; sensitive to scaling/imbalance.
  • PLS regression / SVR: calibration and property prediction; pair with calibration plots + residuals.

Metrics and plots (pair visuals with numbers)

  • Classification: F1_macro, balanced accuracy, confusion matrix; ROC/PR for imbalance (see plot_confusion_matrix, plot_roc_curve).
  • Regression/calibration: RMSE/MAE/R²/Adjusted R², predicted vs true, residuals; plot_calibration_with_ci for confidence bands, plot_bland_altman for agreement.
  • Embeddings: silhouette, between/within F-like stats with permutation p_perm alongside PCA/t-SNE visuals.

Examples

Classification (SVM)

from foodspec.chemometrics.models import make_classifier
from foodspec.chemometrics.validation import cross_validate_pipeline
from foodspec.viz import plot_confusion_matrix

clf = make_classifier("svm_rbf", C=10.0, gamma=0.1)
cv_res = cross_validate_pipeline(clf, X_feat, y_labels, cv_splits=5, scoring="f1_macro")
plot_confusion_matrix(cv_res["confusion_matrix"], labels=class_labels)

Regression / calibration (PLS)

from foodspec.chemometrics.models import make_pls_regression
from foodspec.metrics import compute_regression_metrics
from foodspec.viz import plot_regression_calibration, plot_residuals

pls = make_pls_regression(n_components=3)
pls.fit(X_feat, y_cont)
y_pred = pls.predict(X_feat).ravel()
metrics = compute_regression_metrics(y_cont, y_pred)
plot_regression_calibration(y_cont, y_pred)  # add CI with plot_calibration_with_ci if desired
plot_residuals(y_cont, y_pred)
Regression calibration plot: predicted vs true values

Practical notes for food spectra

  • Imbalance: use macro metrics, class weights, and PR curves for rare positives.
  • Scaling: apply the same scaling/derivative steps per fold; avoid leakage.
  • Interpretation: map coefficients/loadings/importances back to bands (unsaturation, carbonyl) and report ANOVA/Games–Howell on key ratios when relevant.
  • Validate: stratified CV; report supports and CIs; inspect residuals for bias or structure.

Typical plots (with metrics)

  • Confusion matrix + F1/accuracy/supports.
  • ROC/PR + AUC (especially for rare-event adulteration).
  • Predicted vs true + residuals + RMSE/R² (calibration).
  • Calibration curve with CI; Bland–Altman for agreement.
  • Feature importances/loadings to link decisions to wavenumbers.

Reproducible figure generation

  • Classification: PCA + SVM on example oils, then plot_confusion_matrix (ROC/PR if scores available).
  • Regression: PLS on example mixtures; generate calibration/residual plots (plot_regression_calibration, plot_calibration_with_ci, plot_residuals).
  • Agreement: plot_bland_altman when comparing model vs lab measurements.

Summary

  • Match model complexity to data size/linearity; prefer interpretable models when performance is similar.
  • Combine visuals with metrics + uncertainty; avoid leakage; handle imbalance explicitly.
  • Tie outputs back to chemistry (bands/ratios) and support claims with stats tests on key features.

When Results Cannot Be Trusted

⚠️ Red flags for classification and regression model reliability:

  1. Perfect or near-perfect accuracy (>98%) without independent test set
  2. Too-good results suggest overfitting, data leakage, or class separation artifacts
  3. Accuracy on training data inflates; test set is ground truth
  4. Fix: Always use held-out test set or cross-validation; validate on truly independent data

  5. Data leakage (using preprocessing statistics from entire dataset, or same sample in train and test)

  6. Information from test set bleeds into training, producing inflated metrics
  7. Leakage can be subtle (e.g., baseline correction fit on all data, then train/test split)
  8. Fix: Include preprocessing inside CV loop; use Pipeline to prevent leakage; manually inspect train/test splits

  9. Severe class imbalance (95% class A, 5% class B) with no special handling

  10. Imbalanced data can produce misleading accuracy (classifier predicts majority class, achieves 95% accuracy)
  11. F1, precision, recall, or AUC better than accuracy for imbalanced tasks
  12. Fix: Stratify CV splits; use weighted loss or resampling; report confusion matrix and per-class metrics

  13. High feature dimensionality, small sample size (1000 bands, 50 samples) without dimensionality reduction

  14. p >> n (features > samples) causes overfitting and unstable models
  15. Model has enough parameters to memorize training data
  16. Fix: Use PCA or feature selection before modeling; keep p < n or use regularization (L1/L2)

  17. Regression with huge residuals in certain regions unaddressed

  18. Heteroscedasticity (variance increases with predicted value) violates assumptions
  19. May indicate nonlinear relationship or missing important feature
  20. Fix: Visualize predicted vs residuals; check for heteroscedasticity; consider log-transform or nonlinear model

  21. Feature importance from same model used for training (overfitted features list)

  22. Feature importance from overfit model is unreliable; top features may be noise in that model
  23. Importance doesn't generalize to independent data
  24. Fix: Use permutation importance on test set; or report with uncertainty (e.g., from cross-validation)

  25. No attention to batch effects (training on one analyzer, deploying on another)

  26. Analyzer drift, calibration, or instrument variation can dominate learned patterns
  27. Model may not transfer across instruments without retraining
  28. Fix: Use batch-aware CV; test on data from different batches/instruments; include batch as a feature

  29. Unstable cross-validation metrics (fold 1 F1 = 0.95, fold 5 F1 = 0.50)

  30. High fold-to-fold variability suggests model is sensitive to data splits or presence of outliers
  31. Reported average metric masks instability
  32. Fix: Visualize per-fold metrics; check for outliers; increase sample size; use stratified CV
  33. Model evaluation & validation
  34. Metrics & evaluation
  35. Hypothesis testing
  36. Workflow design & reporting