Skip to content

Common Problems & Solutions

Purpose: Systematically diagnose and fix issues across all FoodSpec workflow stages.
Audience: Users troubleshooting preprocessing, ML, stats, or reporting steps.
Time to read: 20–30 minutes (reference guide; read sections as needed).
Prerequisites: Basic knowledge of your FoodSpec workflow stage.


Quick Problem Index

Stage Problem Symptoms Quick Fix
Acquisition Baseline drift Sloping/curved baseline ALS baseline correction with lambda ~1e5
Acquisition Saturation Flat-topped peaks Lower laser power; re-acquire
Acquisition Wavenumber drift Peak shifts vs reference Recalibrate instrument; check validate_spectrum_set
Acquisition Low SNR Noisy spectra Longer integration; better optics; smoothing
Metadata Missing labels Unknown class IDs Use check_missing_metadata; repair metadata
Metadata Class imbalance Poor minority recall Use F1/PR metrics; resample or weight classes
Metadata Mislabeled samples Outlier confusions Audit via PCA; verify and relabel
Preprocessing Over-smoothing Peak loss Reduce Savitzky–Golay window/order
Preprocessing Poor baseline removal Residual slope Tune ALS lambda; try rubberband baseline
Preprocessing Scatter not removed Intensity drift persists Apply SNV or MSC normalization
ML Overfitting High train, low test accuracy Regularize; simplify; use stratified CV
ML Data leakage Unrealistic CV scores Ensure preprocessing inside Pipeline
ML Imbalanced predictions Minority class ignored Use class_weight or SMOTE; report F1_macro
DL Diverging loss NaNs during training Lower learning rate; add normalization
Stats Non-normal residuals Failed assumptions Use nonparametric tests (Kruskal–Wallis)
Stats Multiple comparisons Many marginal p-values Apply FDR/Tukey correction
Visualization Unlabeled axes Ambiguous plots Label wavenumber (cm⁻Âč), intensity (a.u.)
Reporting Missing configs Cannot reproduce Export run_metadata.json; save configs
Workflow Wrong task→metrics Irrelevant metrics Consult workflow design guide; clarify goal

A. Instrument & Acquisition Problems

Baseline drift / fluorescence

  • Why: sample fluorescence, laser instability, optics heating
  • Symptoms: sloping/curved baseline; high low-frequency power
  • Diagnose: overlay raw spectra; run baseline check after ALS/rubberband; SNR via estimate_snr
  • Fix: apply baseline correction (ALS/rubberband); reduce laser power/integration time; instrument recalibration
  • Re-acquire: if baseline consumes dynamic range or varies wildly run-to-run

Saturation / clipping

  • Why: detector overload, too high laser power
  • Symptoms: flat-topped peaks, abrupt ceiling
  • Diagnose: inspect raw intensities; histogram of intensities
  • Fix: lower laser power, shorten integration time; re-acquire if clipping is present

Wavenumber misalignment

  • Why: calibration drift, temperature, instrument change
  • Symptoms: peak shifts vs references
  • Diagnose: compare known standards; cross-correlation of spectra
  • Fix: recalibrate instrument; apply alignment/cropping consistently; re-acquire if shift unstable

Low SNR

  • Why: weak scattering/absorption, poor focus, dirty optics
  • Symptoms: noisy spectra, unstable ratios
  • Diagnose: estimate_snr; high-frequency noise; low reproducibility across replicates
  • Fix: longer integration, more accumulations, better sample prep/optics cleaning; smoothing; re-acquire if SNR too low

B. Dataset & Metadata Problems

Missing or inconsistent metadata

  • Why: incomplete logs, manual entry errors
  • Symptoms: unknown labels, mismatched sample IDs
  • Diagnose: check_missing_metadata; cross-check unique counts; joins fail
  • Fix: repair metadata files; enforce required columns; re-export if gaps persist

Class imbalance

  • Why: rare adulteration/spoilage cases
  • Symptoms: high accuracy, poor minority recall
  • Diagnose: summarize_class_balance; confusion matrix asymmetry; PR curves
  • Fix: resampling/weights, use F1/PR metrics; collect more minority samples

Mislabeled samples

  • Why: data entry or sample mix-up
  • Symptoms: persistent outliers, impossible confusion errors
  • Diagnose: PCA score outliers; detect_outliers; high leverage points
  • Fix: audit sample IDs; remove/relabel after verification; re-acquire if uncertain

C. Preprocessing & Chemometric Problems

Over-smoothing / under-smoothing

  • Symptoms: peak loss or excessive noise
  • Diagnose: compare raw vs smoothed overlays; SNR changes
  • Fix: adjust Savitzky–Golay window/order; avoid smoothing if not needed

Baseline not removed / over-corrected

  • Symptoms: residual slope or negative artifacts
  • Diagnose: inspect corrected spectra; mean spectrum drift
  • Fix: tune ALS lambda/p; try rubberband/polynomial; ensure crop before ratios

Scatter/normalization issues

  • Symptoms: intensity scaling differences remain
  • Diagnose: norms variance across samples; check after SNV/MSC/vector norms
  • Fix: use SNV/MSC; ensure consistent application within pipelines (no leakage)

Peak picking / ratios unstable

  • Symptoms: large variance in peak height/area; missing peaks
  • Diagnose: visualize peak windows; check wavenumber alignment; inspect window tolerance
  • Fix: adjust expected peaks/tolerance; ensure ascending wavenumbers; consider smoothing/cropping first

D. Machine Learning Problems

Overfitting

  • Symptoms: high train accuracy, low test/CV accuracy
  • Diagnose: CV metrics vs train; learning curves
  • Fix: simplify model, regularize, more data, better preprocessing; ensure stratified CV; use compute_classification_metrics

Data leakage

  • Symptoms: unrealistically high CV scores
  • Diagnose: verify preprocessing inside Pipeline; splits done after pipeline definition; no label leakage
  • Fix: wrap preprocessing+model in a single pipeline; redo splits; re-evaluate

Imbalanced performance

  • Symptoms: minority class misclassified
  • Diagnose: confusion matrix by class; PR curves; class balance summary
  • Fix: class weights, resampling, threshold tuning; report F1_macro, balanced accuracy

E. Deep Learning Problems

Unstable training / divergence

  • Symptoms: loss oscillations, NaNs
  • Diagnose: monitor loss/metrics per epoch; check learning rate/batch size
  • Fix: lower learning rate, use normalization, add early stopping/dropout; ensure sufficient data

Overfitting with small data

  • Symptoms: train ≫ val performance
  • Diagnose: validation curves; high variance metrics
  • Fix: regularize, data augmentation (if appropriate), prefer classical models

F. Statistical Problems

Violating test assumptions

  • Symptoms: non-normal residuals, heteroscedasticity
  • Diagnose: residual plots, normality tests, Levene's test
  • Fix: transform data, use nonparametric tests (run_kruskal_wallis, run_mannwhitney_u); report effect sizes

Multiple comparisons without correction

  • Symptoms: many marginal p-values
  • Diagnose: count of tests; inconsistent significance
  • Fix: use Tukey/FDR; emphasize effect sizes; consolidate hypotheses

G. Visualization Problems

Misleading scales / unlabeled axes

  • Symptoms: hard-to-read plots; ambiguous units
  • Diagnose: review plots; check legends/units
  • Fix: label wavenumber (cm⁻Âč), intensity (a.u.), class labels, sample counts; use consistent ranges

Overplotting / clutter

  • Symptoms: unreadable overlays with many samples
  • Diagnose: high-density overlays
  • Fix: show mean ± CI, subset samples, use transparency

H. Reporting & Reproducibility Problems

Missing pipeline/config trace

  • Symptoms: cannot reproduce metrics or plots later
  • Diagnose: absent configs, missing run_metadata.json
  • Fix: use export_run_metadata; record preprocessing, models, metrics, versions

Ambiguous metrics

  • Symptoms: headline accuracy without class counts or CI
  • Diagnose: incomplete reporting
  • Fix: include per-class metrics, supports, CIs/bootstraps; link to metrics documentation

I. Workflow Design Problems

Unclear question → wrong pipeline

  • Symptoms: metrics irrelevant to decision (e.g., accuracy on rare event)
  • Diagnose: revisit scientific goal; map task to metrics/models
  • Fix: consult Workflow Design Guide; pick appropriate metrics/models

Insufficient replicates / imbalance

  • Symptoms: unstable metrics across splits
  • Diagnose: high variance CV; summarize_class_balance
  • Fix: collect more data; use robust CV; consider effect sizes and uncertainty reporting

J. Operational / User Errors

Wrong file format / path

  • Symptoms: loader failures
  • Diagnose: check detect_format, file extensions; consult instrument file formats guide
  • Fix: convert to supported formats (CSV, JCAMP, SPC/OPUS with extras)

Mismatched wavenumber ordering

  • Symptoms: shape errors, misaligned peaks
  • Diagnose: ensure ascending wavenumbers; validate with validate_spectrum_set
  • Fix: sort wavenumbers; re-export if needed

FoodSpec Utilities for Diagnosis

  • estimate_snr(spectrum): rough SNR estimate
  • summarize_class_balance(labels): counts per class
  • detect_outliers(X, method="pca_distance"): simple outlier flagging
  • check_missing_metadata(df, required_cols): ensure metadata completeness

When to Re-acquire Data

  • Severe saturation/clipping; unstable baselines consuming dynamic range
  • Wavenumber calibration drift not correctable in software
  • Extremely low SNR that preprocessing cannot salvage
  • Persistent metadata mislabeling that cannot be resolved