Skip to content

Protocols: Benchmarking Framework

This framework standardizes how to compare preprocessing and modeling pipelines across spectral datasets. It leverages FoodSpec CLI/API to generate metrics, plots, and run metadata for fair comparisons.

Goals

  • Assess robustness and generality of pipelines (authentication, mixture regression, QC).
  • Enable reproducible comparisons (fixed seeds, documented configs).
  • Produce sharable reports (metrics.json, confusion matrices, plots, run_metadata.json).
flowchart LR
  A[Select datasets/tasks] --> B[Define pipelines (preproc+model variants)]
  B --> C[Run benchmarks (CLI/API)]
  C --> D[Collect artifacts (metrics, plots, metadata)]
  D --> E[Analyze & compare (tables/figures)]

How to run (CLI)

Use foodspec protocol-benchmarks --output-dir runs/benchmarks. - Internally runs classification and mixture benchmarks (using public/example loaders or synthetic fallbacks). - Outputs: classification_metrics.json, mixture_metrics.json, plots, run_metadata.json. - See benchmarks/benchmark_oil_authentication.py for a scriptable example.

How to run (scripts)

  • benchmarks/benchmark_oil_authentication.py: runs oil-auth pipeline on example/public data; saves metrics, confusion matrix, PCA plots, run metadata.
  • (Add similar scripts for heating/QC as needed.)

Statistical comparisons across pipelines

  • Use ANOVA/t-tests on performance metrics (e.g., macro F1 across preprocessing variants).
  • For multiple configurations, collect metrics per fold/config, then test differences:
    import pandas as pd
    from foodspec.stats import run_anova
    
    df = pd.DataFrame({"f1": [0.8,0.82,0.83,0.75,0.76,0.77],
                       "pipeline": ["A","A","A","B","B","B"]})
    res = run_anova(df["f1"], df["pipeline"])
    print(res.summary)
    
  • Interpret whether differences are statistically meaningful; report effect sizes when possible.

Designing a benchmark

  • Datasets: Choose representative tasks (oil auth, mixtures, QC). Ensure wavenumbers align.
  • Pipelines: Vary baseline methods (ALS/rubberband), normalization (L2/SNV/MSC), models (RF/SVM/PLS).
  • Validation: Stratified CV for classification; train/test or CV for regression; fix random seeds.
  • Metrics: Classification (accuracy, macro F1, confusion matrix), regression (RMSE, MAE, R², residuals).

Artifacts and metadata

  • Always store: metrics.json, run_metadata.json (Python version, foodspec version, model params, seeds), plots (confusion matrices, residuals/PCA).
  • Prefer YAML configs for pipeline definitions; log them with runs.

Reporting

  • Summarize metrics in tables; show key plots.
  • Discuss robustness: variance across folds/seeds; sensitivity to preprocessing variants.
  • Provide configs and metadata for reproducibility (see Reproducibility checklist).

Further reading