Classification and Regression¶

Purpose: Train supervised models (classifiers, regressors) on spectral features for authentication, QC, and calibration tasks.

Audience: Chemometricians, data scientists, researchers building food spectroscopy models.

Time: 20–30 minutes to read; reference during model selection.

Prerequisites: Familiarity with scikit-learn; understanding of cross-validation, metrics, and overfitting.

When to Use Each Model Type¶

For notation see the Glossary. For bands/ratios, see Feature extraction. Metrics: Metrics & Evaluation.

Decision Tree: Which Model to Use?¶

Your goal	Use this model	Why
Authenticate oils (binary: real vs. fake)	SVM (RBF) or Random Forest	Handles nonlinear boundaries; robust to noise
Classify multiple oil types (multiclass)	Random Forest or PLS-DA	RF scalable; PLS-DA interpretable
Predict mixture composition (continuous)	PLS regression or SVR	Designed for correlated features
Quick baseline (simple data)	Logistic regression or kNN	Fast, interpretable
Imbalanced data (rare adulteration)	XGBoost + class weights	Handles imbalance automatically

What?¶

Defines model families for classification (LogReg, SVM, RF, kNN, PLS-DA) and regression/calibration (PLS, SVR), with inputs (preprocessed features/ratios/PCs) and outputs (predictions, probabilities/scores, calibration curves, metrics).

Why?¶

Authentication/adulteration, QC/novelty, and calibration require models that handle correlated bands and small-to-medium n. Supervised models turn spectral variation into decisions or calibrated estimates, and must be paired with suitable metrics/plots to avoid overclaiming.

When?¶

Use:
- Linear/PLS-DA: modest n, interpretable boundaries, ratios/PCs roughly linear.
- RBF SVM/RF: nonlinear boundaries, richer feature sets.
- PLS/linear/SVR: continuous targets, mixtures, property prediction.
Limitations:
- Small n: risk of overfitting—use CV + CIs.
- Imbalance: accuracy can mislead—use macro F1/PR.
- Always standardize consistently; leakage if scaling across folds.

Where? (pipeline)¶

Upstream: baseline/smoothing/normalization → optional derivatives → features/ratios/PCs.
Model: classifier/regressor.
Downstream: metrics (F1/AUC/RMSE/R² + CIs), plots (confusion, ROC/PR, calibration/residual), stats tests on key ratios (ANOVA/Games–Howell), reporting.

flowchart LR
  A[Preprocess + features] --> B[Classifier / Regressor]
  B --> C[Metrics + plots + stats]
  C --> D[Interpretation & reporting]

Model families (at a glance)¶

Linear / PLS-DA: interpretable coefficients/loadings; good for smaller, near-linear problems.
SVM (linear/RBF): max-margin; RBF handles curved boundaries (tune C, gamma).
Random Forest / Ensembles: nonlinear, feature importances; robust to noisy predictors.
Boosted trees (GradientBoosting / XGBoost / LightGBM): strong tabular performance; handle interactions and imbalance; require tuning (learning rate, trees).
kNN: simple baseline; sensitive to scaling/imbalance.
PLS regression / SVR: calibration and property prediction; pair with calibration plots + residuals.

Metrics and plots (pair visuals with numbers)¶

Classification: F1_macro, balanced accuracy, confusion matrix; ROC/PR for imbalance (see plot_confusion_matrix, plot_roc_curve).
Regression/calibration: RMSE/MAE/R²/Adjusted R², predicted vs true, residuals; plot_calibration_with_ci for confidence bands, plot_bland_altman for agreement.
Embeddings: silhouette, between/within F-like stats with permutation p_perm alongside PCA/t-SNE visuals.

Examples¶

Classification (SVM)¶

from foodspec.chemometrics.models import make_classifier
from foodspec.chemometrics.validation import cross_validate_pipeline
from foodspec.viz import plot_confusion_matrix

clf = make_classifier("svm_rbf", C=10.0, gamma=0.1)
cv_res = cross_validate_pipeline(clf, X_feat, y_labels, cv_splits=5, scoring="f1_macro")
plot_confusion_matrix(cv_res["confusion_matrix"], labels=class_labels)

Regression / calibration (PLS)¶

from foodspec.chemometrics.models import make_pls_regression
from foodspec.metrics import compute_regression_metrics
from foodspec.viz import plot_regression_calibration, plot_residuals

pls = make_pls_regression(n_components=3)
pls.fit(X_feat, y_cont)
y_pred = pls.predict(X_feat).ravel()
metrics = compute_regression_metrics(y_cont, y_pred)
plot_regression_calibration(y_cont, y_pred)  # add CI with plot_calibration_with_ci if desired
plot_residuals(y_cont, y_pred)

Regression calibration plot: predicted vs true values

Practical notes for food spectra¶

Imbalance: use macro metrics, class weights, and PR curves for rare positives.
Scaling: apply the same scaling/derivative steps per fold; avoid leakage.
Interpretation: map coefficients/loadings/importances back to bands (unsaturation, carbonyl) and report ANOVA/Games–Howell on key ratios when relevant.
Validate: stratified CV; report supports and CIs; inspect residuals for bias or structure.

Typical plots (with metrics)¶

Confusion matrix + F1/accuracy/supports.
ROC/PR + AUC (especially for rare-event adulteration).
Predicted vs true + residuals + RMSE/R² (calibration).
Calibration curve with CI; Bland–Altman for agreement.
Feature importances/loadings to link decisions to wavenumbers.

Reproducible figure generation¶

Classification: PCA + SVM on example oils, then plot_confusion_matrix (ROC/PR if scores available).
Regression: PLS on example mixtures; generate calibration/residual plots (plot_regression_calibration, plot_calibration_with_ci, plot_residuals).
Agreement: plot_bland_altman when comparing model vs lab measurements.

Summary¶

Match model complexity to data size/linearity; prefer interpretable models when performance is similar.
Combine visuals with metrics + uncertainty; avoid leakage; handle imbalance explicitly.
Tie outputs back to chemistry (bands/ratios) and support claims with stats tests on key features.

When Results Cannot Be Trusted¶

⚠️ Red flags for classification and regression model reliability:

Perfect or near-perfect accuracy (>98%) without independent test set
Too-good results suggest overfitting, data leakage, or class separation artifacts
Accuracy on training data inflates; test set is ground truth
Fix: Always use held-out test set or cross-validation; validate on truly independent data
Data leakage (using preprocessing statistics from entire dataset, or same sample in train and test)
Information from test set bleeds into training, producing inflated metrics
Leakage can be subtle (e.g., baseline correction fit on all data, then train/test split)
Fix: Include preprocessing inside CV loop; use Pipeline to prevent leakage; manually inspect train/test splits
Severe class imbalance (95% class A, 5% class B) with no special handling
Imbalanced data can produce misleading accuracy (classifier predicts majority class, achieves 95% accuracy)
F1, precision, recall, or AUC better than accuracy for imbalanced tasks
Fix: Stratify CV splits; use weighted loss or resampling; report confusion matrix and per-class metrics
High feature dimensionality, small sample size (1000 bands, 50 samples) without dimensionality reduction
p >> n (features > samples) causes overfitting and unstable models
Model has enough parameters to memorize training data
Fix: Use PCA or feature selection before modeling; keep p < n or use regularization (L1/L2)
Regression with huge residuals in certain regions unaddressed
Heteroscedasticity (variance increases with predicted value) violates assumptions
May indicate nonlinear relationship or missing important feature
Fix: Visualize predicted vs residuals; check for heteroscedasticity; consider log-transform or nonlinear model
Feature importance from same model used for training (overfitted features list)
Feature importance from overfit model is unreliable; top features may be noise in that model
Importance doesn't generalize to independent data
Fix: Use permutation importance on test set; or report with uncertainty (e.g., from cross-validation)
No attention to batch effects (training on one analyzer, deploying on another)
Analyzer drift, calibration, or instrument variation can dominate learned patterns
Model may not transfer across instruments without retraining
Fix: Use batch-aware CV; test on data from different batches/instruments; include batch as a feature
Unstable cross-validation metrics (fold 1 F1 = 0.95, fold 5 F1 = 0.50)
High fold-to-fold variability suggests model is sensitive to data splits or presence of outliers
Reported average metric masks instability
Fix: Visualize per-fold metrics; check for outliers; increase sample size; use stratified CV
Model evaluation & validation
Metrics & evaluation
Hypothesis testing
Workflow design & reporting