Skip to content

Decision Guide: Choosing the Right Approach

Purpose: Navigate FoodSpec's methods, workflows, and APIs based on your research goals.

This guide helps you choose the appropriate analysis path by asking: "What am I trying to do?" Each decision leads to specific methods, working examples, and API references.


๐ŸŽฏ Quick Decision Tree

flowchart TD
    Start[What is your goal?] --> Goal{Goal Type}

    Goal -->|Identify/Classify| Class[Classification]
    Goal -->|Quantify| Quant[Quantification]
    Goal -->|Monitor Change| Monitor[Temporal Analysis]
    Goal -->|Compare Instruments| Harm[Harmonization]
    Goal -->|Clean Data| Prep[Preprocessing]

    Class --> ClassType{Known Classes?}
    ClassType -->|Yes, 2+ groups| Auth[Authentication/Discrimination]
    ClassType -->|Unknown patterns| Explore[Exploratory Analysis]

    Quant --> QuantType{What to measure?}
    QuantType -->|Component %| Mix[Mixture Analysis]
    QuantType -->|Continuous property| Reg[Regression]

    Monitor --> MonType{What changes?}
    MonType -->|Quality degradation| Heat[Heating/Aging]
    MonType -->|Batch consistency| QC[Quality Control]

    Harm --> HarmType{Data Type}
    HarmType -->|Different instruments| Calib[Calibration Transfer]
    HarmType -->|Different matrices| Matrix[Matrix Correction]

๐Ÿ” Goal-Based Navigation

1. Classification & Discrimination

1.1 Authenticate or Detect Adulteration

When: You need to distinguish genuine samples from adulterants or identify product origin.

Decision factors: - Small dataset (<100 samples): Use PLS-DA with cross-validation - Large dataset (>1000 samples): Consider deep learning or ensemble methods - Interpretability required: Use ratio-based features or VIP scores - Black-box acceptable: Neural networks or random forests

Matrix considerations: - Pure oils: Standard preprocessing โ†’ classification - Complex matrices (chips, meat): Add scatter correction + MSC normalization

โ†’ Method: Classification & Regression
โ†’ Example: Oil Authentication
โ†’ API: ML & Validation

Typical workflow:

from foodspec import FoodSpec

# Load and preprocess
fs = FoodSpec.from_csv("oils.csv", modality="raman")
fs = fs.baseline_als().normalize_snv()

# Classify
result = fs.classify(
    label_column="oil_type",
    model="pls-da",
    cv_folds=5
)
print(f"Accuracy: {result.accuracy:.1%}")


1.2 Exploratory Analysis (Unknown Groupings)

When: You suspect patterns but don't have labels, or want to discover subgroups.

Decision factors: - Dimensionality reduction first: Always start with PCA - Cluster hypothesis testing: Use PERMANOVA or ANOSIM - Outlier detection: Check before clustering

โ†’ Method: PCA & Dimensionality Reduction
โ†’ Example: Exploratory PCA in Examples
โ†’ API: Chemometrics

Typical workflow:

from foodspec.chemometrics import run_pca

# Run PCA
pca_result = run_pca(
    X=fs.x,
    n_components=5,
    scale=True
)

# Visualize
pca_result.plot_scores(
    labels=fs.metadata["batch"],
    title="Batch Clustering"
)


2. Quantification

2.1 Mixture Analysis (Component Percentages)

When: Estimate % composition of known components in mixtures.

Decision factors: - Known pure references available: Use MCR-ALS or NNLS - No pure references: Use PLS regression with calibration set - 2-3 components: Direct peak ratios may suffice - 4+ components: Multivariate methods required

โ†’ Method: Mixture Models
โ†’ Example: Mixture Analysis
โ†’ API: Chemometrics - Mixture Analysis

Typical workflow:

from foodspec.chemometrics import mcr_als

# MCR-ALS for 3-component mixture
result = mcr_als(
    X=mixture_spectra,
    n_components=3,
    initial_guess=pure_spectra,
    max_iter=100
)

# Get concentrations
concentrations = result.C  # Sample ร— component


2.2 Regression (Continuous Properties)

When: Predict continuous values (moisture %, protein content, peroxide value).

Decision factors: - Linear relationship expected: PLS regression - Nonlinear relationships: Random forest, neural networks - Small calibration set (<50): PLS with careful validation - Large calibration set (>200): More complex models feasible

โ†’ Method: Classification & Regression
โ†’ Example: Calibration Example
โ†’ API: ML & Validation

Typical workflow:

from foodspec.ml import nested_cross_validate

# PLS regression with nested CV
results = nested_cross_validate(
    X=fs.x,
    y=fs.metadata["moisture_percent"],
    model="pls",
    cv_folds=5,
    n_components_range=[1, 2, 3, 5, 10]
)
print(f"Rยฒ = {results['r2']:.3f}, RMSE = {results['rmse']:.2f}")


3. Temporal Analysis & Monitoring

3.1 Heating & Degradation Monitoring

When: Track quality changes over time (oxidation, thermal degradation, shelf life).

Decision factors: - Known degradation markers: Track specific peak ratios over time - Unknown mechanisms: Use multivariate time-series analysis - Predict shelf life: Fit degradation models to ratio trajectories

โ†’ Method: Statistical Analysis
โ†’ Example: Heating Quality Monitoring
โ†’ API: Workflows

Typical workflow:

from foodspec.workflows import analyze_heating_trajectory

# Analyze time series
result = analyze_heating_trajectory(
    spectra=fs,
    time_column="heating_time_min",
    ratio_numerator=1655,  # C=C unsaturation
    ratio_denominator=1440  # CH2 reference
)

# Get shelf life estimate
shelf_life = result.estimate_shelf_life(threshold=0.8)
print(f"Estimated shelf life: {shelf_life} hours")


3.2 Batch Quality Control

When: Monitor production batches for consistency and drift detection.

Decision factors: - Continuous monitoring: Control charts with Hotelling's Tยฒ - Batch comparison: ANOVA or Kruskal-Wallis tests - Outlier detection: Mahalanobis distance or PCA residuals - Small batch sizes (<10): Use robust statistics

โ†’ Method: Statistical Study Design
โ†’ Example: Batch QC Workflow
โ†’ API: Statistics

Typical workflow:

from foodspec.qc import check_class_balance, detect_outliers

# Check batch consistency
balance = check_class_balance(fs.metadata, "batch_id")
outliers = detect_outliers(
    fs.x,
    method="mahalanobis",
    threshold=3.0
)

# Statistical comparison
from foodspec.stats import run_anova
anova_result = run_anova(
    fs.x[:, peak_idx],  # Specific peak
    groups=fs.metadata["batch_id"]
)


4. Harmonization & Instrument Comparability

4.1 Different Instruments (Same Sample Type)

When: Combine data from multiple Raman or FTIR instruments measuring the same samples.

Decision factors: - Standards available: Piecewise Direct Standardization (PDS) - No standards, overlapping samples: Direct Standardization (DS) - Completely different wavelength ranges: May not be harmonizable

โ†’ Method: Harmonization Theory
โ†’ Example: Multi-Instrument Workflow
โ†’ API: Calibration Transfer

Typical workflow:

from foodspec.calibration_transfer import piecewise_direct_standardization

# Transfer from instrument A to B
transfer = piecewise_direct_standardization(
    X_source=spectra_instrument_A,
    X_target=spectra_instrument_B,
    window_size=11
)

# Apply to new measurements
X_harmonized = transfer.transform(X_new_from_A)


4.2 Different Matrices (Same Measurement Goal)

When: Compare oils in pure form vs. oils extracted from fried chips, or milk vs. cheese.

Decision factors: - Known matrix effects: Apply matrix-specific corrections first - Unknown effects: Domain adaptation or transfer learning - Small target matrix data: Use source matrix model with caution

โ†’ Method: Matrix Effects
โ†’ Example: Matrix Correction
โ†’ API: Workflows - Matrix Correction

Typical workflow:

from foodspec.matrix_correction import apply_matrix_correction

# Correct for matrix effects
corrected = apply_matrix_correction(
    X_target=chips_spectra,
    X_reference=oil_spectra,
    method="msc"
)


5. Preprocessing & Data Cleaning

5.1 Which Preprocessing Steps Do I Need?

Decision factors by symptom:

Symptom Solution Method API
Curved baselines, fluorescence Baseline correction Baseline Correction baseline_als
Different intensities, scaling issues Normalization Normalization normalize_snv
Noisy spectra, hard to see peaks Smoothing Smoothing savgol_smooth
Cosmic ray spikes (Raman) Spike removal Cosmic Rays CosmicRayRemover
Overlapping peaks Derivatives (1st/2nd) Derivatives savgol_smooth
Scatter effects, particle size MSC/SNV Scatter Correction MSCNormalizer

Recommended preprocessing order: 1. Cosmic ray removal (if Raman) 2. Baseline correction (if curved backgrounds) 3. Smoothing (if noisy) 4. Normalization (SNV or MSC) 5. Derivatives (optional, for overlapping peaks) 6. Feature extraction or full-spectrum modeling

โ†’ Full Guide: Preprocessing Methods Overview


๐Ÿ“Š Dataset Size & Complexity Guide

Small Datasets (<100 samples)

Challenges: Limited statistical power, risk of overfitting.

Recommended approaches: - Preprocessing: Conservative (avoid over-smoothing) - Feature selection: Use a priori knowledge (literature-based peaks) - Validation: Leave-one-out CV or stratified 5-fold CV - Models: Simple models (PLS-DA, linear regression) - Avoid: Deep learning, complex ensemble methods

โ†’ Guide: Study Design - Sample Size


Medium Datasets (100-1000 samples)

Opportunities: Moderate statistical power, can test multiple methods.

Recommended approaches: - Preprocessing: Standard pipelines - Feature selection: Data-driven + domain knowledge hybrid - Validation: Nested cross-validation with holdout test set - Models: PLS, random forests, gradient boosting - Hyperparameter tuning: Grid search feasible

โ†’ Guide: Cross-Validation Best Practices


Large Datasets (>1000 samples)

Opportunities: High statistical power, can use complex models.

Recommended approaches: - Preprocessing: Automated pipelines acceptable - Feature selection: Automatic feature importance ranking - Validation: Train/validation/test splits - Models: Neural networks, deep learning, ensembles - Advanced techniques: Transfer learning, multi-task learning

โ†’ Guide: Advanced Deep Learning


๐Ÿงช Sample Matrix Guide

Pure Liquids (Oils, Solvents)

Characteristics: Minimal scatter, good optical contact.

Preprocessing: - Baseline: Mild (ALS with conservative parameters) - Normalization: Vector or area normalization - Scatter correction: Usually not needed

โ†’ Example: Oil Authentication


Powders & Solids (Flour, Spices)

Characteristics: High scatter from particle size variations.

Preprocessing: - Baseline: Aggressive (ALS or rubberband) - Normalization: SNV or MSC (critical) - Scatter correction: Essential

โ†’ Method: Scatter Correction


Emulsions & Suspensions (Milk, Juices)

Characteristics: Complex scatter, heterogeneous.

Preprocessing: - Baseline: Moderate - Normalization: MSC with robust mean - Homogenization: May need sample prep guidance


Tissue & Meat Products

Characteristics: Variable water content, complex matrix.

Preprocessing: - Baseline: Essential - Normalization: SNV recommended - Water bands: May need masking (1640 cmโปยน, 3200-3600 cmโปยน)


๐Ÿ”— Cross-Reference Table

Goal Method Page Example API
Oil authentication Classification Oil Example ML API
Heating monitoring Statistics Heating Example Workflows API
Mixture quantification Mixture Models Mixture Example Chemometrics API
Hyperspectral mapping Spatial Analysis HSI Example Datasets API
Baseline correction Baseline Methods Recipe Card #2 Preprocessing API
PCA exploration PCA Guide PCA Examples Chemometrics API
Batch QC Study Design QC Workflow Statistics API
Multi-instrument Harmonization Harmonization Workflow Workflows API

๐Ÿงญ Still Not Sure?

If you're uncertain which approach to use:

  1. Start simple: Run PCA on preprocessed data to visualize structure
  2. Check assumptions: Read Study Design for sample size guidance
  3. Try examples: Run closest teaching example with your data
  4. Ask for help: See FAQ or community discussions

Common pitfalls to avoid: - โŒ Applying complex models to small datasets - โŒ Skipping preprocessing for raw spectra - โŒ Not validating results properly (train/test leakage) - โŒ Ignoring matrix effects in heterogeneous samples

Best practices: - โœ… Start with visualization (PCA, score plots) - โœ… Use domain knowledge for feature selection - โœ… Validate rigorously (nested CV or holdout test) - โœ… Document preprocessing decisions - โœ… Report uncertainty (confidence intervals, p-values)


๐Ÿ“š Further Reading