ML & Chemometrics: Mixture Models and Fingerprinting¶
Compositional analysis decomposes mixtures into fractions of known or unknown components, while fingerprinting compares spectra for QC or search. This page follows the WHAT/WHY/WHEN/WHERE template.
For notation see the Glossary. For plots and metrics see Metrics & Evaluation and Visualization.
What?¶
Defines NNLS (non-negative least squares) for single mixtures with known references, MCR-ALS for unsupervised mixtures, and fingerprint similarity (cosine/correlation). Inputs: preprocessed spectra, reference spectra, or libraries. Outputs: fractions/coefficients, reconstructed spectra, similarity scores, and metrics (RMSE/R²).
Why?¶
Linear mixtures of food components (oils, adulterants, moisture) can be estimated physically when non-negativity is enforced. Fingerprinting supports QC/search by comparing against libraries.
When?¶
Use:
- NNLS: known pure/reference spectra, want non-negative fractions per sample.
- MCR-ALS: multiple mixtures, components unknown/partially known.
- Fingerprinting: QC/search against libraries.
Limitations: assumes linear mixing and aligned preprocessing; scatter/scale issues must be minimized; MCR-ALS can be sensitive to initialization.
Where? (pipeline)¶
Upstream: consistent preprocessing/cropping/normalization for mixtures and references.
Model: NNLS or MCR-ALS; fingerprint similarity optional for QC.
Downstream: reconstruction plots, residual analysis, RMSE/R², stats on ratios/coefficients.
flowchart LR
A[Preprocess refs + mixtures] --> B[NNLS / MCR-ALS]
B --> C[Fractions + reconstruction]
C --> D[Metrics (RMSE/R²) + plots]
D --> E[Reporting / stats]
NNLS math & interpretation¶
Given reference spectra matrix (A \in \mathbb{R}^{m\times n}) (columns = pure components, rows = wavenumbers) and mixture (y \in \mathbb{R}^m), solve
[
\min_{x} |A x - y|_2^2 \quad \text{s.t. } x \ge 0.
]
- (A): pure/reference spectra (e.g., EVOO, sunflower).
- (x): non-negative fractions/coefficients.
- (y): observed mixture.
Non-negativity enforces physical interpretability. Assumes linear mixing and matched preprocessing.
Minimal code example (NNLS)¶
import numpy as np
from foodspec.chemometrics.mixture import nnls_mixture
coeffs, resid = nnls_mixture(y, A) # y: (n_points,), A: (n_points, n_components)
fractions = coeffs / coeffs.sum()
Visuals + metrics¶
- Plot observed mixture y, reconstructed A @ x̂, and residual y - A @ x̂.
- Good fit: close overlay, residual without structure; quantify with RMSE/R² (see metrics chapter).
- Reproducible figure: run
to savepython docs/examples/visualization/generate_mixture_nnls_figures.pydocs/assets/nnls_overlay.pnganddocs/assets/nnls_residual.pngusing synthetic references. Use example oils if desired by swapping in real references.
MCR-ALS (outline)¶
- Factorize mixtures matrix (\mathbf{X} \approx \mathbf{C}\mathbf{S}^\top) iteratively with non-negativity.
- Returns concentrations (\mathbf{C}) and estimated pure-like spectra (\mathbf{S}).
- Monitor convergence, enforce non-negativity, and compare reconstructed X to data (RMSE, residual structure).
Fingerprinting¶
- Cosine/correlation similarities for QC/search against libraries.
- Plot heatmaps or top-k matches; thresholds should be validated per application.
Typical plots (with metrics)¶
- Mixture overlay + residual (report RMSE/R²).
- Coefficient/fraction bar plots.
- Similarity heatmaps for fingerprint search.
- Optional: residual distribution to spot systematic misfit.
Practical guidance¶
- Align wavenumbers and preprocessing between mixtures and references.
- Normalize or scatter-correct before NNLS to reduce scale effects.
- Start MCR-ALS with sensible initial guesses; check for rotations/scale indeterminacy.
- Pair visuals with metrics (RMSE/R²) and, if comparing groups, use stats tests on coefficients/ratios (ANOVA/Games–Howell).
When Results Cannot Be Trusted¶
⚠️ Red flags for mixture decomposition validity:
- Reference spectra mismatched to real mixtures (using pure oil standards to decompose adulterated oils with unexpected components)
- NNLS/MCR-ALS estimates are only valid if reference set spans true composition space
- Missing components force solution to fit residuals with existing references, producing wrong fractions
-
Fix: Ensure reference library includes all expected components; validate against known mixtures
-
Degenerate solutions (fractions sum to 1.0 with near-zero negative values, suggesting numerical instability)
- Ill-conditioning (nearly collinear references) allows multiple valid solutions
- Small perturbations in data yield different solutions
-
Fix: Check condition number of reference matrix; validate solution stability with bootstrap or cross-validation
-
Estimated fractions outside [0, 1] (negative fraction or >100% total) reported without warning
- NNLS constrains to non-negative, but if total doesn't equal 1.0, assumes unaccounted component
- Unconstrained solving (OLS) may produce negative fractions, indicating poor fit or leakage
-
Fix: Use constrained solver (NNLS); check sum-to-1 constraint; validate on known mixtures
-
Rotational ambiguity in MCR-ALS unresolved (multiple equivalent solutions with same fit, different spectra)
- MCR can have rank deficiency → multiple (A, S) pairs fit data equally well
- Reported spectra may not be true pure component spectra
-
Fix: Test for rotational ambiguity; use constraints (non-negativity, bounds) to enforce unique solution; validate spectra chemically
-
Preprocessing choices not disclosed or not matched to references
- If samples preprocessed differently from references, NNLS produces biased fractions
- Example: baseline correction on samples but not references → fractions shift
-
Fix: Preprocess samples and references identically; freeze preprocessing before decomposition
-
No validation on known mixtures (deploying model without testing on mixtures of known composition)
- NNLS/MCR can produce chemically plausible fractions even on synthetic data
- Only validation on truly known mixtures confirms model works
-
Fix: Test on lab-prepared mixtures of known composition; report agreement (RMSE, R²) against true fractions
-
Small number of references (2 pure components) used to estimate many mixture fractions
- With only 2 references and many wavenumbers, system is typically underdetermined; many solutions fit
- Tiny errors in spectra or preprocessing cause large fraction changes
-
Fix: Increase reference library; use more wavenumber regions; apply regularization
-
Visualization doesn't match quantitative metrics (visual inspection suggests good fit, but RMSE is high)
- Residual plots can be misleading; always pair with numeric metrics
- High RMSE despite visually acceptable fit may indicate systematic bias (e.g., baseline residual)
- Fix: Report both visual residuals and RMSE/R²; visualize all mixtures, not just representative ones
- Document reference provenance; mismatched references yield biased fractions.