Glossary¶
Purpose: Define key terminology for FoodSpec documentation at multiple levels (plain English, technical, mathematical).
Audience: Anyone encountering unfamiliar terms—from absolute beginners to domain experts.
Time: 2–5 minutes per term lookup.
Prerequisites: None (designed for all audiences).
Starter Terms (Plain English / Layer 1)¶
Spectrum (plural: Spectra)¶
Layman: A "fingerprint" showing how a sample reflects or absorbs different colors of light. Each food has a unique pattern.
Food Scientist: An array of intensity measurements across wavelengths/wavenumbers, capturing molecular vibrations (Raman/FTIR) or electronic transitions (UV-Vis).
Physicist: Intensity $I(\tilde{\nu})$ as a function of wavenumber $\tilde{\nu}$ (cm⁻¹), encoding vibrational transitions via $I \propto \frac{\partial\alpha}{\partial Q}$.
Peak / Band¶
Layman: A bump in the fingerprint graph. Each bump represents a specific type of molecule (like a signature ingredient).
Food Scientist: A local maximum in a spectrum corresponding to a specific molecular vibration (e.g., carbonyl C=O at 1740 cm⁻¹).
Physicist: Spectral feature arising from vibrational mode $\nu_i$ with frequency $\omega_i = \sqrt{k/\mu}$ (spring constant $k$, reduced mass $\mu$).
Ratio¶
Layman: Comparing two bumps in the fingerprint. Example: If bump A is twice as tall as bump B, the ratio is 2:1. This cancels out lighting differences.
Food Scientist: Intensity or area of one peak divided by another (e.g., $I_{1742} / I_{2720}$) to normalize illumination and focus on chemical composition changes.
Physicist: $R = \frac{I(\tilde{\nu}_1)}{I(\tilde{\nu}_2)}$ where $I(\tilde{\nu}_i) = I_0 \cdot \sigma(\tilde{\nu}_i) \cdot N_i$. Ratio cancels common factors ($I_0$, path length).
Authentication¶
Layman: Proving food is what it claims to be (e.g., verifying olive oil isn't mixed with cheaper sunflower oil).
Food Scientist: Classification workflow distinguishing genuine samples from adulterants using spectral fingerprints and statistical models.
Physicist: Pattern recognition via supervised learning ($\mathbf{y} = f(\mathbf{X})$) where $\mathbf{X}$ are spectral features and $\mathbf{y}$ are class labels, validated with cross-validation.
Preprocessing¶
Layman: Cleaning the data to remove noise and background interference—like adjusting a blurry photo before analyzing it.
Food Scientist: Baseline correction (removing background fluorescence/drift), smoothing (noise reduction), and normalization (standardizing intensity scales).
Physicist: Operations including ALS baseline ($\min_{\mathbf{b}} ||\mathbf{y} - \mathbf{b}||^2 + \lambda \sum |\Delta^2 b_i|$), Savitzky-Golay filtering, and reference peak normalization.
Wavenumber Axis¶
Layman: The x-axis of your spectrum graph, showing different "colors" of infrared light from low to high energy.
Food Scientist: The spectral dimension measured in wavenumbers (cm⁻¹), inversely proportional to wavelength. Higher wavenumbers = higher energy vibrations.
Physicist: $\tilde{\nu} = \frac{1}{\lambda} = \frac{\omega}{2\pi c}$ where $\lambda$ is wavelength, $\omega$ is angular frequency, and $c$ is speed of light. For FTIR: typically 4000-400 cm⁻¹; Raman: varies by laser.
Conventions: Always ascending order (low → high wavenumbers). Resolution typically 2-8 cm⁻¹ for FTIR, 1-4 cm⁻¹ for Raman.
Resolution (Spectral)¶
Layman: How detailed your measurement is—like the difference between regular and high-definition TV. Higher resolution shows finer details.
Food Scientist: The smallest wavenumber interval that can be distinguished. Lower values (e.g., 2 cm⁻¹) = better resolution = sharper peaks.
Physicist: $\Delta\tilde{\nu} = \frac{1}{2L}$ where $L$ is the optical path difference in Fourier-transform instruments. Determines peak width via convolution with instrumental line shape.
Typical ranges: FTIR: 2-8 cm⁻¹, Raman: 1-4 cm⁻¹, NIR: 4-16 cm⁻¹.
Interpolation¶
Layman: Filling in the gaps when your data points don't line up—like drawing a smooth curve through dots on graph paper.
Food Scientist: Resampling spectra to a common wavenumber grid before analysis. Required when combining data from different instruments with different sampling points.
Physicist: Computing $I(\tilde{\nu}{\text{target}})$ from measured $I(\tilde{\nu})$ via linear, cubic spline, or sinc interpolation. Preserves peak positions but may introduce artifacts if over-sampled.}
Best practice: Use cubic spline for smooth spectra, linear for noisy data. Never interpolate to resolution finer than instrument capability.
Baseline¶
Layman: The "background hum" in your spectrum—like removing static noise from a radio signal before listening to music.
Food Scientist: Slowly varying background signal from fluorescence, instrument drift, or sample matrix. Must be removed to isolate true peaks.
Physicist: Additive offset $B(\tilde{\nu})$ where observed $I_{\text{obs}}(\tilde{\nu}) = I_{\text{true}}(\tilde{\nu}) + B(\tilde{\nu})$. Estimated via polynomial fitting, rubber-band methods, or ALS algorithm.
Warning: Over-correction removes real broad peaks; under-correction biases peak heights.
Normalization¶
Layman: Adjusting all spectra to the same "volume level" so you can fairly compare them—like setting all songs to the same loudness.
Food Scientist: Scaling intensity values to account for variations in sample thickness, concentration, or illumination. Enables quantitative comparison across samples.
Physicist: Transformation $I'(\tilde{\nu}) = f(I(\tilde{\nu}))$ where $f$ can be: - SNV: $(I - \mu_I) / \sigma_I$ (Standard Normal Variate) - Vector norm: $I / ||I||2$ or $I / ||I||_1$ - Reference peak: $I / I(\tilde{\nu})$ - }MSC: Multiplicative Scatter Correction
Best practice: Choose based on physics (reference peak for internal standards, SNV for scatter correction, vector norm for concentration independence).
Label Encoding¶
Layman: Converting category names into numbers the computer can understand—like assigning "olive oil" = 1, "sunflower oil" = 2.
Food Scientist: Mapping categorical labels (variety, batch, treatment) to integers for machine learning models.
Physicist: Bijection $\phi: {\text{class}_1, \ldots, \text{class}_C} \to {0, 1, \ldots, C-1}$ or one-hot encoding $y_i \to [0, 0, 1, 0, \ldots]$ for $C$ classes.
Warning: Use one-hot encoding for tree-based models; integer encoding for neural networks with embedding layers. Never use arbitrary integer ordering (e.g., 1=low, 2=medium, 3=high) unless ordinal relationship exists.
Sample vs. Replicate¶
Layman: A sample is one bottle of oil; a replicate is measuring that same bottle multiple times. Samples vary naturally (biology); replicates vary from measurement error.
Food Scientist: Sample = independent biological/physical unit (different batches, sources). Replicate = repeated measurement of same sample (technical variation). Important for proper CV strategy.
Physicist: Variance decomposition: $\sigma_{\text{total}}^2 = \sigma_{\text{biological}}^2 + \sigma_{\text{technical}}^2$. Replicates estimate $\sigma_{\text{technical}}$; samples estimate $\sigma_{\text{biological}}$.
Best practice: Keep replicates together in same fold during CV to avoid leakage (see below). Report both technical and biological variability separately.
Matrix Effect¶
Layman: When the "packaging" affects the measurement—like how coffee tastes different in a plastic cup vs. ceramic mug, even though it's the same coffee.
Food Scientist: The food substrate (e.g., potato chips, meat) altering the spectral signature of the analyte of interest (e.g., frying oil). Complicates quantification.
Physicist: Interaction between analyte and matrix modifying absorption coefficients, scattering properties, or chemical environment. Described by Beer-Lambert deviations: $A \neq \epsilon c l$.
Mitigation strategies: Matrix-matched calibration, standard addition methods, or chemometric models trained on diverse matrices.
Leakage (Data Leakage)¶
Layman: Accidentally giving the computer the answers during training—like studying for a test using the actual test questions. Makes results look better than they really are.
Food Scientist: Occurs when test data information "leaks" into training. Common sources: (1) replicates split across folds, (2) preprocessing on full dataset before CV, (3) feature selection using test labels.
Physicist: Correlation between training set $\mathcal{D}{\text{train}}$ and test set $\mathcal{D}}}$ where $I(\mathcal{D{\text{train}}; \mathcal{D}) > 0$ (mutual information).}
How to prevent: - Keep replicates together in same fold - Perform preprocessing within each CV fold separately - Use nested CV for hyperparameter tuning - Never look at test labels before final evaluation
Warning: Leakage inflates performance metrics (accuracy, R²) by 10-50%, leading to models that fail in production.
CV Strategy (Cross-Validation Strategy)¶
Layman: The rules for how you split your data into training and testing groups—like deciding how to divide teams for a practice game.
Food Scientist: The splitting scheme for estimating model generalization. Must match deployment scenario (random = similar samples, batch-aware = new instruments, temporal = future predictions).
Physicist: Partitioning function $\pi: \mathcal{D} \to {\mathcal{D}{\text{train}}^{(k)}, \mathcal{D}^K$ with constraints based on metadata (batch, time, replicate groups).}}^{(k)}}_{k=1
Common strategies: - Random K-fold: Shuffle and split randomly (valid only if samples truly independent) - Stratified K-fold: Preserves class proportions in each fold - Group K-fold: Keeps sample groups (replicates, batches) together (prevents leakage) - Time-series split: Training on past, testing on future (for temporal data) - Leave-one-batch-out: Train on N-1 batches, test on held-out batch (harshest test)
Best practice for FoodSpec: Use Group K-fold with sample_id groups to keep replicates together. For multi-instrument studies, use Leave-one-batch-out.
Cross-Validation (CV)¶
Layman: Testing the computer's learning by hiding some data, training on the rest, then checking if it predicts the hidden data correctly. Prevents cheating.
Food Scientist: Splitting data into training/test folds multiple times to estimate model generalization. Batch-aware CV keeps batches intact to avoid leakage.
Physicist: Partitioning $\mathcal{D}$ into $K$ folds ${\mathcal{D}k}$, computing $\text{Acc} = \frac{1}{K} \sum}^K \text{Acc}(\hat{f{-k}, \mathcal{D}_k)$ where $\hat{f}$ is trained without fold $k$.
Batch / Instrument Drift¶
Layman: Different machines or measurement days produce slightly different readings for the same sample—like two bathroom scales giving different weights.
Food Scientist: Systematic variation across instruments, operators, or time periods that must be accounted for in validation to ensure models generalize.
Physicist: Additive/multiplicative offsets $\mathbf{y}_{\text{batch}} = a \cdot \mathbf{y} + \mathbf{b}$ requiring harmonization (calibration transfer, multiplicative scatter correction).
Balanced Accuracy¶
Layman: Accuracy that treats all categories fairly, even if you have way more samples of one type. Example: If you test 100 olive oils and only 10 palm oils, this metric doesn't overemphasize olive oil.
Food Scientist: Average of per-class recall scores: $\text{BA} = \frac{1}{C} \sum_{c=1}^C \frac{\text{TP}_c}{\text{TP}_c + \text{FN}_c}$. Corrects class imbalance.
Physicist: Macro-averaged sensitivity across $C$ classes, invariant to class priors unlike raw accuracy.
FDR (False Discovery Rate)¶
Layman: When testing many things at once (e.g., 1000 peaks), some will look important by chance. FDR controls how many false alarms you accept.
Food Scientist: Multiple-testing correction ensuring expected proportion of false positives $\leq \alpha$ (e.g., 5%). Use Benjamini-Hochberg procedure.
Physicist: $\text{FDR} = \mathbb{E}[\frac{V}{R}]$ where $V$ = false positives, $R$ = total rejections. BH controls FDR at level $\alpha$.
Protocol / Workflow¶
Layman: A recipe for analyzing food samples—step-by-step instructions the computer follows automatically.
Food Scientist: YAML file specifying preprocessing, harmonization, QC checks, RQ analysis, and output generation with validation strategy.
Physicist: Declarative pipeline $\mathcal{P} = {s_1, s_2, \ldots, s_n}$ where each step $s_i$ applies transformation $T_i: \mathcal{X}_{i-1} \to \mathcal{X}_i$ with versioned parameters.
Technical Terms (Layer 2-3)¶
RQ (Ratio-Quality) Engine¶
Definition: FoodSpec module computing stability (CV/MAD), discriminative power (ANOVA F, effect sizes), trends (regression slopes), divergence (two-group comparisons), minimal panels (greedy feature selection), and clustering metrics on peak areas and ratios.
When to use: Quality control workflows requiring identification of reproducible, discriminative markers.
Assumptions: Approximately normal distributions for parametric tests; sufficient sample size ($n \geq 20$ per group).
ALS (Asymmetric Least Squares) Baseline¶
Definition: Baseline correction minimizing $||\mathbf{y} - \mathbf{b}||^2 + \lambda \sum |\Delta^2 b_i| + w_i(\mathbf{y}_i - \mathbf{b}_i)$ where $w_i = 0$ for $\mathbf{y}_i < \mathbf{b}_i$, $w_i = 1$ otherwise.
When to use: Removing background fluorescence/drift in Raman/FTIR spectra.
Failure modes: Over-smoothing ($\lambda$ too large) removes real peaks; under-smoothing ($\lambda$ too small) retains baseline.
HSI (Hyperspectral Imaging)¶
Definition: 3D datacube $(x, y, \tilde{\nu})$ capturing spatially resolved spectra at each pixel. Enables mapping chemical composition across surfaces.
When to use: Surface contamination detection, heterogeneity analysis, ROI extraction.
Assumptions: Stable illumination across field of view; negligible spatial-spectral coupling.
MOATS (Model Optimized by Accumulated Threshold Selection)¶
Definition: Feature selection algorithm maximizing classification accuracy while minimizing feature count. Iteratively adds features based on cumulative importance.
When to use: Building minimal marker panels for cost-effective QA/QC (fewer measurements = faster/cheaper).
Failure modes: Greedy selection may miss optimal combinations; requires validation on independent test set.
Harmonization¶
Definition: Aligning spectra from different instruments/batches via wavenumber calibration, power normalization, or calibration transfer (e.g., Piecewise Direct Standardization).
When to use: Multi-instrument studies, longitudinal monitoring, pooling data across labs.
Assumptions: Linear/affine relationship between instruments; consistent sample preparation.
Mathematical Notation¶
Spectroscopy Symbols¶
- $I(\tilde{\nu})$: Intensity at wavenumber $\tilde{\nu}$ (cm⁻¹)
- $\lambda$: Wavelength (nm); $\tilde{\nu} = 10^7 / \lambda$
- $\frac{\partial\alpha}{\partial Q}$: Polarizability derivative (determines Raman activity)
- $\sigma$: Scattering cross-section
Linear Algebra¶
- $\mathbf{X}$: Data matrix (rows = samples, columns = features)
- $\mathbf{y}$: Response vector (observed spectrum or labels)
- $\Sigma$: Covariance matrix ($\Sigma = \frac{1}{n-1} \mathbf{X}^\top \mathbf{X}$)
- $V, \Lambda$: Eigenvector and eigenvalue matrices in PCA ($\Sigma = V \Lambda V^\top$)
- $T, P$: PCA scores and loadings ($\mathbf{X} = TP^\top + E$)
Statistics¶
- $\alpha$: Significance level (typically 0.05)
- $d$: Cohen's d effect size ($d = \frac{\mu_1 - \mu_2}{\sigma_{\text{pooled}}}$)
- $s^2$: Variance
- $f$: F-statistic (ANOVA, ratio of between-group to within-group variance)
- $p_{\text{perm}}$: Permutation p-value (fraction of permuted test statistics ≥ observed)
Abbreviations (Alphabetical)¶
- ALS: Asymmetric Least Squares baseline correction
- ANOVA: Analysis of Variance (tests group mean differences)
- AUC: Area Under the Curve (ROC or PR curve)
- CI: Confidence Interval
- CV: Coefficient of Variation ($\text{CV} = \sigma/\mu \times 100\%$) OR Cross-Validation
- DL: Deep Learning
- FDR: False Discovery Rate (multiple testing correction)
- F1: Harmonic mean of precision and recall: $F1 = \frac{2 \cdot \text{Prec} \cdot \text{Rec}}{\text{Prec} + \text{Rec}}$
- FTIR: Fourier Transform Infrared Spectroscopy
- HSI: Hyperspectral Imaging
- IoU: Intersection over Union (segmentation accuracy metric)
- LOA: Limits of Agreement (Bland–Altman analysis)
- MAPE: Mean Absolute Percentage Error
- MAE: Mean Absolute Error
- MAD: Median Absolute Deviation (robust dispersion measure)
- MCC: Matthews Correlation Coefficient
- MCR-ALS: Multivariate Curve Resolution – Alternating Least Squares
- MOATS: Model Optimized by Accumulated Threshold Selection
- NIR: Near-Infrared Spectroscopy
- NNLS: Non-Negative Least Squares (constrained regression for mixture fractions)
- OC-SVM: One-Class Support Vector Machine (novelty detection)
- PCA: Principal Component Analysis
- PLS / PLS-DA: Partial Least Squares (Regression / Discriminant Analysis)
- PR curve: Precision–Recall curve
- QC: Quality Control
- RMSE: Root Mean Square Error
- ROC: Receiver Operating Characteristic curve
- RQ: Ratio-Quality (FoodSpec analysis engine)
- SNR: Signal-to-Noise Ratio
- t-SNE: t-distributed Stochastic Neighbor Embedding (visualization only, not for inference)
Domain-Specific Terms¶
Edible Oils¶
- OO: Olive Oil
- PO: Palm Oil
- VO: Vegetable Oil (often sunflower or canola)
- CO: Coconut Oil
- Carbonyl band: ~1740 cm⁻¹ (C=O stretch in oxidized oils)
- Unsaturation band: ~1650 cm⁻¹ (C=C stretch in double bonds)
- Reference peak: ~2720 cm⁻¹ (used for normalization)
Food Science¶
- Adulteration: Mixing with cheaper/undeclared ingredients
- Thermal degradation: Chemical changes during heating/frying (oxidation, polymerization)
- Matrix effect: Food substrate (e.g., chips) altering oil spectral signature
- Shelf life: Time until quality degrades below acceptable threshold
When Terms Are Used Incorrectly¶
Common Mistake: Using "accuracy" for imbalanced datasets.
Fix: Use balanced accuracy or F1 score.
Common Mistake: Calling t-SNE a "model."
Fix: t-SNE is visualization only; use PCA for dimensionality reduction in modeling.
Common Mistake: "Cross-validation" without specifying batch-aware.
Fix: Always state validation strategy (random, stratified, batch-aware, nested).
What's Next?¶
- See notation in context: RQ Engine Theory
- Understand validation terms: Validation Strategies
- Learn preprocessing methods: Preprocessing Recipes
Can't find a term? Open an issue: https://github.com/chandrasekarnarayana/foodspec/issues