Skip to content

Theory – Chemometrics & ML Basics

Purpose: Understand core dimensionality reduction, classification, and validation methods used in FoodSpec.
Audience: Users building ML models or interpreting multivariate analysis results.
Time to read: 12–15 minutes.
Prerequisites: Familiarity with spreadsheets and basic statistics helpful.


This page summarizes core concepts underpinning FoodSpec analyses. For worked examples, see Tutorials (02) and Cookbook (03).

Core methods

  • PCA: unsupervised dimensionality reduction; reveals structure/clusters and supports clustering metrics (silhouette/ARI).
  • PLS/PLS-DA: regression/classification with latent variables (not always needed for simple ratio sets but common in spectroscopy).
  • Classification: logistic regression (often with L1 for minimal panels), random forests for nonlinear importance; balanced accuracy/confusion matrices for evaluation.

Why cross-validation matters

  • Prevents optimistic bias; estimates generalization performance.
  • Batch-aware or group-aware splits avoid leakage across instruments/batches.
  • Nested CV supports feature selection/hyperparameter tuning without reusing test folds.

Scaling/normalization

  • Standardization and ratiometric features stabilize intensity variations; see preprocessing recipes and RQ theory for why specific ratios are used.

See also: cookbook_validation.md for applied examples.

How FoodSpec uses these: - PCA/MDS visualizations and clustering metrics in RQ outputs. - Classification (LR/RF) for discrimination and minimal panels. - Cross-validation strategies (batch-aware/nested) for honest performance estimates.


When Results Cannot Be Trusted

⚠️ Red flags for chemometrics and ML theory application:

  1. High-dimensional data (p >> n) without regularization or dimensionality reduction
  2. Overfitting guaranteed; model memorizes noise
  3. Unstable coefficients and poor generalization
  4. Fix: Use PCA before classification; use regularized models (Ridge, Lasso); enforce p < n or use cross-validation

  5. Eigenvalues/variance explained not examined (using PCs without checking explained variance)

  6. PC5 explaining <1% variance is likely noise; including it overfits
  7. Cumulative variance plateaus; using components beyond plateau adds noise
  8. Fix: Plot scree plot; use cumulative variance rule (e.g., 95%) to choose n_components

  9. Collinearity among features not detected (using all spectral bands as predictors without checking VIF)

  10. Correlated features inflate coefficients and reduce stability
  11. Model unstable to small data perturbations
  12. Fix: Compute VIF; remove/group collinear features; use PCA or regularization

  13. Linear model assumptions not checked before using linear methods (using PLS on non-linear data)

  14. Linear methods assume linear relationships; non-linear data requires non-linear models
  15. Predictions biased; confidence intervals unreliable
  16. Fix: Visualize feature-target relationships; use non-linear models if relationships non-linear

  17. Class imbalance ignored in classification (95% class A, 5% class B, standard classifier applied)

  18. Classifier biased toward majority; minority class ignored
  19. Standard metrics (accuracy) misleading; minority class F1 can be near zero
  20. Fix: Use stratified CV; class weights; report per-class metrics

  21. Distance metrics not matched to data type (Euclidean distance on compositional data)

  22. Euclidean distance assumes continuous, not-compositional data
  23. Can produce misleading clustering
  24. Fix: Use compositional distances (Aitchison, log-ratio) or appropriate metrics for data type

  25. Scaling applied inconsistently (training scaled, test not scaled, or vice versa)

  26. Feature magnitudes mismatch between train and test
  27. Models produce wrong predictions on unscaled test data
  28. Fix: Always apply same scaling to train and test; include scaling in pipeline

  29. Latent factors interpreted causally ("PC1 is oxidation" without validation)

  30. PCs are mathematical combinations; may not correspond to chemistry
  31. Interpretation requires independent validation
  32. Fix: Cross-check PC loadings with chemistry; validate interpreted factors with targeted measurements

Next Steps