User Guide β Data Formats & HDF5¶
Purpose: Understand how to format your data (CSV vs HDF5) and choose the right format for reproducibility.
Audience: Lab managers preparing datasets; researchers building reproducible pipelines.
Time: 15β20 minutes to choose format; 10 min per dataset to validate.
Prerequisites: Familiarity with CSV files or HDF5 basics.
This page explains supported input formats, the HDF5 layout, and vendor IO expectations.
Why it matters: Choosing the right format affects reproducibility (FAIR metadata), harmonization, and ease of loading via CLI.
CSV vs HDF5: Quick Comparison¶
| Feature | CSV | HDF5 |
|---|---|---|
| Learning curve | Easy | Medium |
| File size | Large | Compact |
| Metadata | Limited | Full (FAIR) |
| Speed (1000+ samples) | Slow | Fast |
| Preprocessing history | Not stored | Stored |
| Multi-instrument data | Awkward | Natural |
| Best for | Getting started | Production pipelines |
Example: Convert CSV β HDF5
from foodspec.io import load_csv, to_hdf5
# Load CSV
ds = load_csv("oils.csv", wavenumber_col="wavenumber")
# Convert to HDF5 with metadata
to_hdf5(
ds,
"oils_processed.h5",
instrument_metadata={"laser": 785, "grating": "1200/mm"},
preprocessing_log=["baseline_als(lam=1e6)", "normalize_snv()"]
)
Choosing Your Format¶
Use CSV if: - β Dataset < 500 samples - β Single instrument, single batch - β Quick exploratory analysis - β Sharing via email/GitHub
Use HDF5 if: - β Dataset > 500 samples - β Multiple instruments or batches - β Need preprocessing/protocol history - β Publishing reproducible research - β Integrating 3D hyperspectral cubes
CSV vs HDF5¶
- CSV: Wide-format with wavenumber columns and metadata columns (oil_type, matrix, heating_stage, replicate, batch, etc.). Easiest to start with.
- HDF5: Preferred for FAIR storage. FoodSpec uses a NeXus-inspired layout with explicit groups and units.
HDF5 layout (simplified)¶
/spectra/wn_axis: wavenumber axis (units attr:cm^-1)/spectra/intensities: spectra matrix (n_samples Γ n_wavenumbers)/spectra/sample_table: annotations (oil_type, matrix, heating_stage, batch, replicate, instrument, etc.)/instrument/: laser_wavelength_nm, grating, objective, calibration parameters/preprocessing/: list of preprocessing steps with parameters/protocol/: protocol name/version, step definitions, validation strategy- Attributes:
foodspec_hdf5_schema_versionfor compatibility
Notes: - HDF5 retains preprocessing/protocol history, visible in metadata.
Vendor IO¶
- FoodSpec supports generic CSV/HDF5 and provides vendor loader stubs (OPUS/WiRE/ENVI). If binary parsing is incomplete, export to CSV or HDF5 from your instrument software.
- Plugins can register additional vendor loaders; see
registry_and_plugins.md. - Error messages will hint at missing blocks/headers if a vendor file is malformed; follow the suggested export path (e.g., βexport as ASCII/CSVβ).
Choosing a format¶
- Use CSV for quick starts and small datasets.
- Use HDF5 for multi-instrument/batch projects, HSI cubes, and when you want provenance and harmonization metadata preserved.
- For HSI, store cubes and segmentation outputs in HDF5; label maps and ROI tables are also written to run bundles.
Mini-workflow¶
1) Export data as CSV (wide) or HDF5 using FoodSpec save functions.
2) Load via CLI (--input my.h5).
3) Run a protocol; verify metadata.json reflects format, preprocessing, harmonization.
Next Steps¶
- Data loading: Loading spectra from files
- Preprocessing: Baseline correction and smoothing
- Vendor integration: Vendor I/O guide
- Reference: Data format schema
See also: cookbook_preprocessing.md and registry_and_plugins.md for vendor plugins.