Workflow: Mixture Analysis¶

📋 Standard Header¶

Purpose: Estimate component fractions in mixtures (e.g., olive oil adulterated with sunflower oil) using spectral unmixing.

When to Use: - Quantify adulteration level in food products - Determine blending ratios in multi-component samples - Monitor process streams for composition drift - Validate supplier blend specifications - Estimate impurity levels when pure references available

Inputs: - Format: HDF5 or CSV with mixture spectra + (optional) pure component spectra - Required metadata: mixture_id; optional: true fractions (for validation) - Pure spectra: One spectrum per component (if using NNLS) - Wavenumber range: Must align across mixtures and pure spectra (typically 600–1800 cm⁻¹) - Min samples: 10+ mixtures; 1 pure spectrum per component (if using NNLS)

Outputs: - fractions.csv — Estimated component fractions for each mixture - pred_vs_true.png — Scatter plot (if ground truth available) - residual_plot.png — Reconstruction error per mixture - spectral_reconstruction.png — Observed vs reconstructed spectra - report.md — RMSE, R², correlation for each component

Assumptions: - Spectra are linear combinations of pure components (Beer-Lambert holds) - Pure spectra representative of components in mixtures (no matrix effects) - Wavenumbers aligned (no shifts between mixtures and pure spectra) - No interactions between components (no chemical reactions in mixtures)

🔬 Minimal Reproducible Example (MRE)¶

Option A: NNLS with Pure References¶

import numpy as np
import matplotlib.pyplot as plt
from foodspec.chemometrics.mixture import run_mixture_analysis_workflow
from foodspec.viz.mixture import plot_pred_vs_true
from examples.mixture_analysis_quickstart import _synthetic_mixtures

# Generate synthetic mixtures with known fractions
mixtures, pure_spectra, true_coeffs = _synthetic_mixtures(n_mixtures=30)
print(f"Mixtures: {mixtures.x.shape[0]} spectra")
print(f"Pure components: {pure_spectra.x.shape[0]}")
print(f"True fractions shape: {true_coeffs.shape}")

# Run NNLS workflow
result = run_mixture_analysis_workflow(
    mixtures=mixtures.x,
    pure_spectra=pure_spectra.x,
    mode="nnls"  # Non-negative least squares
)

# Extract results
pred_fractions = result["coefficients"]
residual_norms = result["residual_norms"]

# Evaluate if ground truth available
from sklearn.metrics import r2_score, mean_squared_error
for i, comp in enumerate(['Component A', 'Component B']):
    r2 = r2_score(true_coeffs[:, i], pred_fractions[:, i])
    rmse = np.sqrt(mean_squared_error(true_coeffs[:, i], pred_fractions[:, i]))
    print(f"{comp}: R²={r2:.3f}, RMSE={rmse:.3f}")

# Plot predicted vs true
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for i, (ax, comp) in enumerate(zip(axes, ['Component A', 'Component B'])):
    plot_pred_vs_true(
        true_coeffs[:, i],
        pred_fractions[:, i],
        ax=ax
    )
    ax.set_title(comp)
    ax.set_xlabel("True Fraction")
    ax.set_ylabel("Predicted Fraction")
plt.tight_layout()
plt.savefig("mixture_pred_vs_true.png", dpi=150, bbox_inches='tight')
print("Saved: mixture_pred_vs_true.png")

Expected Output:

Mixtures: 30 spectra
Pure components: 2
True fractions shape: (30, 2)

Component A: R²=0.952, RMSE=0.042
Component B: R²=0.948, RMSE=0.045

Saved: mixture_pred_vs_true.png

Option B: MCR-ALS without Pure References¶

from foodspec.chemometrics.mcr import run_mcr_als

# Same mixture data, but don't use pure spectra
mixtures, _, true_coeffs = _synthetic_mixtures(n_mixtures=30)

# Run MCR-ALS (alternating least squares)
result_mcr = run_mcr_als(
    mixtures.x,
    n_components=2,  # Must specify number of components
    max_iter=100,
    tol=1e-4
)

pred_fractions_mcr = result_mcr["coefficients"]
recovered_spectra = result_mcr["components"]

print(f"MCR-ALS converged in {result_mcr['n_iter']} iterations")
print(f"Final residual: {result_mcr['residual']:.4f}")

# Note: MCR-ALS may recover components in different order or with scaling
# Alignment to true fractions requires post-processing

✅ Validation & Sanity Checks¶

Success Indicators¶

Predicted vs True (if ground truth available): - ✅ R² > 0.90 for each component - ✅ RMSE < 0.05 (5% error in fraction estimation) - ✅ Points cluster tightly around diagonal (y = x line)

Residual Analysis: - ✅ Residual norms small and uniform across mixtures - ✅ No systematic patterns in residual spectra (random noise only) - ✅ Reconstructed spectra visually match observed spectra

Fractions Sum to 1: - ✅ Sum of predicted fractions ≈ 1.0 for each mixture (within 0.05) - ✅ All fractions ≥ 0 (non-negativity constraint satisfied) - ✅ No fractions > 1.0 (physically impossible)

Failure Indicators¶

⚠️ Warning Signs:

R² < 0.70 or RMSE > 0.15
Problem: Poor prediction; model not capturing mixture composition
Fix: Check wavenumber alignment; verify pure spectra quality; increase spectral resolution
Predicted fractions sum >> 1.0 or << 1.0
Problem: Scaling issue; normalization mismatch; missing component
Fix: Renormalize pure spectra; check if mixture contains unlabeled component
Negative fractions (NNLS should prevent this)
Problem: Algorithm not converging; numerical instability
Fix: Increase iteration limit; check for collinear pure spectra; try regularization
High residuals for specific mixtures (outliers)
Problem: Mixture contains impurity; pure spectra not representative; spectral shift
Fix: Remove outlier mixtures; check for temperature/pH effects; verify sample quality
MCR-ALS components don't match pure spectra
Problem: Component order ambiguous; scaling arbitrary; local minimum
Fix: Use better initialization; increase iterations; compare spectral features manually

Quality Thresholds¶

Metric	Minimum	Good	Excellent
R² (per component)	0.80	0.92	0.98
RMSE (fraction)	< 0.10	< 0.05	< 0.02
Residual Norm	< 5%	< 2%	< 0.5%
Fraction Sum Deviation	< 0.10	< 0.05	< 0.02

⚙️ Parameters You Must Justify¶

Critical Parameters¶

1. Unmixing Method - Parameter: mode ("nnls" or "mcr_als") - Default: "nnls" if pure spectra available - When to adjust: Use MCR-ALS if pure spectra unknown or unavailable - Justification: "Non-negative least squares (NNLS) was used to estimate component fractions, as pure spectra were available and Beer-Lambert linearity assumed."

2. Number of Components (MCR-ALS only) - Parameter: n_components - No default: Must specify - When to adjust: Based on prior knowledge or exploratory PCA (scree plot) - Justification: "Two components were specified based on known binary mixture composition (olive + sunflower oils)."

3. Spectral Alignment - Parameter: Wavenumber range, interpolation - Critical: Must align mixtures and pure spectra - Justification: "All spectra were interpolated to a common wavenumber grid (600–1800 cm⁻¹, 1 cm⁻¹ resolution) to ensure Beer-Lambert additivity."

4. Normalization - Parameter: Method (area, max, L2) applied to pure spectra - Default: Area normalization (unit area under curve) - When to adjust: Use L2 if intensity scaling consistent; use max if peak heights comparable - Justification: "Pure spectra were area-normalized to unit integral, ensuring fractions represent volume/mass ratios."

5. Convergence Criteria (MCR-ALS) - Parameter: tol (tolerance), max_iter - Default: tol=1e-4, max_iter=100 - When to adjust: Increase max_iter if not converging; tighten tol if precision needed - Justification: "MCR-ALS was terminated when relative change in residual < 1e-4 or 100 iterations reached."

flowchart LR
  subgraph Data
    A[Raw mixtures] --> A2[Optional pure refs]
  end
  subgraph Preprocess
    B[Baseline + smoothing + norm + align]
  end
  subgraph Features
    C[Stay in spectra or optional PCA/ratios]
  end
  subgraph Model/Stats
    D[NNLS (with refs) or MCR-ALS]
    E[Metrics: RMSE, R²; residuals]
  end
  subgraph Report
    F[Pred vs true + residual overlays + report.md]
  end
  A --> B --> C --> D --> E --> F
  A2 --> D

1. Problem and dataset¶

Why labs care: Quantify adulteration level; determine blending ratios; monitor process streams.
Inputs: Mixture spectra; optional pure/reference spectra. Ground truth fractions if available for evaluation.
Typical size: Dozens of mixtures; references for each component if using NNLS.

2. Pipeline (default)¶

Preprocessing: Baseline → smoothing → normalization; ensure wavenumbers align across mixtures/pure spectra.
Methods:
NNLS: If pure spectra known. Solve ( \min | x - S c |_2 ) s.t. ( c \ge 0 ).
MCR-ALS: If pure spectra unknown; alternating least squares with non-negativity; requires n_components.
Outputs: Coefficients/fractions, residual norms, relative reconstruction error.

3. Python example (synthetic)¶

from foodspec.chemometrics.mixture import run_mixture_analysis_workflow
from examples.mixture_analysis_quickstart import _synthetic_mixtures

mix, pure, true_coeffs = _synthetic_mixtures()
res = run_mixture_analysis_workflow(mixtures=mix.x, pure_spectra=pure.x, mode="nnls")
print("Coefficients:\\n", res["coefficients"])
print("Residual norms:", res["residual_norms"])

4. CLI example (with config)¶

Create examples/configs/mixture_quickstart.yml:

mixture_hdf5: libraries/mixtures.h5
pure_hdf5: libraries/pure_refs.h5
mode: nnls

Run:

foodspec mixture --config examples/configs/mixture_quickstart.yml --output-dir runs/mixture_demo

Outputs: coefficients CSV, residuals, optional reconstruction plots.

5. Interpretation¶

Compare predicted vs true fractions (if known); report RMSE/MAE and R².
Inspect residuals; large or structured residuals may indicate missing components or misalignment.
Main figure: predicted vs true scatter; Supplement: residual plots, spectra overlays.

Qualitative & quantitative interpretation¶

Qualitative: Overlay observed mixture vs NNLS reconstruction; residual should look like noise, not structured bands. Predicted vs true fractions plot should follow 1:1 line.
Quantitative: Report reconstruction RMSE/R²; residual norm from NNLS; optional permutation p_perm on between/within separation if visualizing embeddings of fractions/components. Link to Metrics and Stats for regression diagnostics.
Reviewer phrasing: “NNLS reconstruction overlays the observed spectrum with RMSE = …; predicted fractions track true values (R² = …); residuals show no systematic misfit.”

Summary¶

Use NNLS when pure references are available; otherwise MCR-ALS with chosen n_components.
Align wavenumbers and preprocess consistently before solving.
Report fractions, residuals, and assumptions clearly.

Statistical analysis¶

Why: Assess how well predicted fractions align with truth; test differences across mixtures if grouped.

Example (correlation/regression on predicted vs true):

from foodspec.chemometrics.mixture import run_mixture_analysis_workflow
from examples.mixture_analysis_quickstart import _synthetic_mixtures
import numpy as np
import pandas as pd
from foodspec.stats import compute_correlations

mix, pure, true_coeffs = _synthetic_mixtures()
res = run_mixture_analysis_workflow(mixtures=mix.x, pure_spectra=pure.x, mode="nnls")
pred = res["coefficients"][:, 0]  # fraction of first component
df = pd.DataFrame({"pred": pred, "true": true_coeffs[:, 0]})
corr = compute_correlations(df, ("pred", "true"), method="pearson")
print(corr)

Interpretation: High correlation (and low residual norms) indicates accurate mixture estimation. Report RMSE/MAE and consider ANOVA if comparing multiple pipelines.

When Results Cannot Be Trusted¶

⚠️ Red flags for mixture analysis workflow:

Pure component references not validated (assuming reference spectra are true pure components)
Impure references (contamination, oxidation, partial adulteration) bias fraction estimates
NNLS/MCR produces wrong fractions if references don't span true composition
Fix: Validate references independently (HPLC, mass spec); document purity
Mixture fractions estimated without checking constraints (fractions sum to >100% or include negative values)
Unconstrained solving produces chemically implausible solutions
Indicates ill-conditioning, missing components, or data errors
Fix: Use constrained NNLS; enforce sum-to-1 constraint; investigate violations
Model tested only on synthetic mixtures or known reference blends (not real food samples)
Real samples have components not in reference library
Forcing solution into limited reference set produces biased fractions
Fix: Test on real food samples; compare estimates to orthogonal method (HPLC, GC-MS); report agreement
Number of components not determined independently (assuming n_components = n_references without validation)
Rank deficiency can hide true component number
Assuming fixed components may miss unexpected contaminants
Fix: Estimate rank from data (SVD, scree plot); validate with independent chemical analysis
Preprocessing applied to mixture differently than references (sample has baseline correction, references don't)
Spectral mismatch produces biased fractions
NNLS cannot compensate for preprocessing inconsistency
Fix: Preprocess samples and references identically; freeze parameters before mixture analysis
Mixture proportions highly variable across replicates without investigation
High fold-to-fold variability (e.g., 30% ± 20%) indicates method instability
May reflect preprocessing sensitivity or noise
Fix: Investigate variability sources; test preprocessing robustness; increase replication/averaging
No detection limit study (claiming method can measure 1% admixture without validation)
Method uncertainty depends on noise, reference quality, and mixture complexity
Claimed precision may exceed actual detectability
Fix: Test on mixtures near suspected limit; report limit of detection/quantitation; validate with orthogonal methods
Residuals not visualized or checked (only RMSE reported, no residual spectrum inspection)
Systematic residuals indicate model failure or missing components
Residual spectra reveal what the mixture model can't explain
Fix: Plot residuals vs wavenumber; inspect for systematic patterns; investigate large residuals