Protocols: Benchmarking Framework¶
This framework standardizes how to compare preprocessing and modeling pipelines across spectral datasets. It leverages FoodSpec CLI/API to generate metrics, plots, and run metadata for fair comparisons.
Goals¶
- Assess robustness and generality of pipelines (authentication, mixture regression, QC).
- Enable reproducible comparisons (fixed seeds, documented configs).
- Produce sharable reports (metrics.json, confusion matrices, plots, run_metadata.json).
flowchart LR
A[Select datasets/tasks] --> B[Define pipelines (preproc+model variants)]
B --> C[Run benchmarks (CLI/API)]
C --> D[Collect artifacts (metrics, plots, metadata)]
D --> E[Analyze & compare (tables/figures)]
How to run (CLI)¶
Use foodspec protocol-benchmarks --output-dir runs/benchmarks.
- Internally runs classification and mixture benchmarks (using public/example loaders or synthetic fallbacks).
- Outputs: classification_metrics.json, mixture_metrics.json, plots, run_metadata.json.
- See benchmarks/benchmark_oil_authentication.py for a scriptable example.
How to run (scripts)¶
benchmarks/benchmark_oil_authentication.py: runs oil-auth pipeline on example/public data; saves metrics, confusion matrix, PCA plots, run metadata.- (Add similar scripts for heating/QC as needed.)
Statistical comparisons across pipelines¶
- Use ANOVA/t-tests on performance metrics (e.g., macro F1 across preprocessing variants).
- For multiple configurations, collect metrics per fold/config, then test differences:
import pandas as pd from foodspec.stats import run_anova df = pd.DataFrame({"f1": [0.8,0.82,0.83,0.75,0.76,0.77], "pipeline": ["A","A","A","B","B","B"]}) res = run_anova(df["f1"], df["pipeline"]) print(res.summary) - Interpret whether differences are statistically meaningful; report effect sizes when possible.
Designing a benchmark¶
- Datasets: Choose representative tasks (oil auth, mixtures, QC). Ensure wavenumbers align.
- Pipelines: Vary baseline methods (ALS/rubberband), normalization (L2/SNV/MSC), models (RF/SVM/PLS).
- Validation: Stratified CV for classification; train/test or CV for regression; fix random seeds.
- Metrics: Classification (accuracy, macro F1, confusion matrix), regression (RMSE, MAE, R², residuals).
Artifacts and metadata¶
- Always store: metrics.json, run_metadata.json (Python version, foodspec version, model params, seeds), plots (confusion matrices, residuals/PCA).
- Prefer YAML configs for pipeline definitions; log them with runs.
Reporting¶
- Summarize metrics in tables; show key plots.
- Discuss robustness: variance across folds/seeds; sensitivity to preprocessing variants.
- Provide configs and metadata for reproducibility (see Reproducibility checklist).