Skip to content

Foundations: Data Structures and FAIR Principles

This chapter explains how FoodSpec represents spectral data and how to keep analyses FAIR (Findable, Accessible, Interoperable, Reusable). It introduces the core data models and storage formats used throughout the book.

1. Core data models

  • FoodSpectrumSet: 2D array x (n_samples × n_wavenumbers), shared wavenumbers (1D, ascending cm⁻¹), metadata (pandas DataFrame), modality tag ("raman", "ftir", "nir").
  • HyperSpectralCube: 3D array (height, width, n_wavenumbers) with optional flattening to a FoodSpectrumSet for pixel-wise analysis.
  • Validation: Monotonic axes, matching shapes, metadata length equals n_samples; see validation utilities and foodspec.validation.

2. Storage formats

  • HDF5 libraries: Preferred for reproducibility; store x, wavenumbers, metadata_json, modality, provenance (software version, timestamps). See Libraries.
  • CSV (wide/long): Common export from instruments; convert to HDF5 via CSV → HDF5 pipeline.
  • Provenance: Keep config files, run metadata, model registry entries; see Reproducibility checklist.

3. FAIR principles applied

  • Findable: Clear file names, metadata columns (sample_id, label columns like oil_type), DOI/URLs for public datasets.
  • Accessible: Use open formats (CSV, HDF5) and documented folder structures.
  • Interoperable: Monotonic wavenumbers in cm⁻¹, standard column names, modality tags; avoid vendor lock-in.
  • Reusable: Record preprocessing choices, model configs, seeds, software versions; archive reports and model artifacts.

4. When to use which structure

  • Batch analyses: FoodSpectrumSet for single-spot spectra; choose HDF5 libraries for storage and sharing.
  • Imaging: HyperSpectralCube for spatial maps; flatten to FoodSpectrumSet for pixel-wise ML, then reshape labels/maps.
  • Library search/QC: Maintain curated HDF5 libraries with consistent metadata; use fingerprint similarity or one-class models.

5. Example (high level)

from foodspec.core.dataset import FoodSpectrumSet
from foodspec.data.libraries import create_library, load_library

# Build in memory
fs = FoodSpectrumSet(
   x=spectra_array,           # (n_samples, n_wavenumbers)
   wavenumbers=wavenumber_array,  # (n_wavenumbers,)
   metadata=metadata_df,       # pandas DataFrame
   modality="raman"
)

# Persist to HDF5
create_library(path="libraries/oils.h5", spectra=fs.x, wavenumbers=fs.wavenumbers,
               metadata=fs.metadata, modality=fs.modality)
fs_loaded = load_library("libraries/oils.h5")

Summary

  • FoodSpectrumSet and HyperSpectralCube are the backbone of analyses.
  • Use HDF5 with provenance for FAIR, reproducible storage.
  • Standardize axes, metadata, and modality to stay interoperable.

Further reading


When Results Cannot Be Trusted

⚠️ Red flags for data structures and FAIR compliance:

  1. Metadata missing or incomplete (files named sample1.csv with no sample ID, date, instrument info)
  2. Cannot reproduce analysis; provenance lost
  3. Data not reusable or findable
  4. Fix: Include sample ID, date, instrument, operator, prep protocol in metadata; use structured formats (HDF5, CSV with metadata header)

  5. File format changes mid-project (CSV for batch 1, Excel for batch 2)

  6. Inconsistent parsing; data cleaning errors
  7. Analysis scripts break
  8. Fix: Freeze data format before project start; convert all to common format (HDF5, CSV) with validation

  9. No versioning or changelog (data files overwritten; no record of changes)

  10. Cannot trace data evolution; reproducibility lost
  11. Errors undetectable
  12. Fix: Use version control (Git); document changes in CHANGELOG; never overwrite raw data

  13. Raw data not archived (only processed data saved)

  14. Cannot reprocess if preprocessing errors found
  15. Can't apply new methods
  16. Fix: Archive raw data separately; document processing steps; keep processing scripts with data

  17. Data not FAIR (stored locally, not shared, no DOI, no license)

  18. Not findable, accessible, interoperable, or reusable
  19. Scientific reproducibility impossible
  20. Fix: Deposit data in public repository (Zenodo, Figshare, domain-specific); assign DOI; add CC-BY or CC0 license

  21. Data structure incompatible with FoodSpec (wrong column names, missing wavenumber axis)

  22. FoodSpec expects specific data structure
  23. Parsing failures or silent errors
  24. Fix: Follow FoodSpec data spec; use validation tools; test import before analysis