Core API¶

Core data structures and workflows for spectral analysis.

The foodspec.core module provides foundational classes for working with spectral data, including dataset containers, single spectrum operations, and result packaging.

Main Classes¶

FoodSpectrumSet¶

Primary container for spectral datasets with aligned metadata.

Collection of spectra with aligned metadata and axis information.

Parameters¶

x : Array of shape (n_samples, n_wavenumbers) containing spectral intensities. wavenumbers : Array of shape (n_wavenumbers,) with the spectral axis values. metadata : DataFrame with one row per sample storing labels and acquisition info. modality : Spectroscopy modality identifier: "raman", "ftir", or "nir".

batch_ids `property` ¶

batch_ids

Return batch identifier column if configured.

Returns:

Type	Description
`Optional[Series]`	pandas.Series \| None: Batch/run identifiers or None if
`Optional[Series]`	`batch_col` is not set or missing.

groups `property` ¶

groups

Return grouping column if configured.

Returns:

Type	Description
`Optional[Series]`	pandas.Series \| None: Group identifiers (e.g., folds) or None if
`Optional[Series]`	`group_col` is not set or missing.

labels `property` ¶

labels

Return label column if configured.

Returns:

Type	Description
`Optional[Series]`	pandas.Series \| None: Label values aligned to samples, or None if
`Optional[Series]`	`label_col` is not set or missing.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": ["A", "B"]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> ds.labels.tolist()
['A', 'B']

getitem ¶

__getitem__(index)

Return a subset by integer position.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice`	Zero-based row index or slice over samples.	required

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	New dataset containing `x`/`metadata` rows
	`'FoodSpectrumSet'`	selected by `index`; wavenumbers are copied.

Raises:

Type	Description
`IndexError`	If an integer index is out of range.
`TypeError`	If `index` is not an int or slice.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.arange(6).reshape(3, 2), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds_sub = ds[1:]
>>> ds_sub.x.shape
(2, 2)

len ¶

__len__()

Number of spectra in the set.

Returns:

Name	Type	Description
`int`	`int`	Number of samples (axis 0 of `x`).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(
...     x=np.ones((3, 5)),
...     wavenumbers=np.arange(5),
...     metadata=pd.DataFrame({"label": [0, 1, 0]}),
... )
>>> len(ds)
3

add_metadata_column ¶

add_metadata_column(name, values, *, overwrite=False)

Attach a metadata column aligned with spectra.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name to add to `metadata`.	required
`values`	`Sequence[Any]`	Iterable of length `n_samples` containing values aligned to rows.	required
`overwrite`	`bool`	If True, replace an existing column of the same name; otherwise raise.	`False`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	New dataset with the added/overwritten column.

Raises:

Type	Description
`ValueError`	If lengths mismatch or column exists and `overwrite` is False.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds2 = ds.add_metadata_column("batch", [1, 2])
>>> ds2.metadata["batch"].tolist()
[1, 2]

apply ¶

apply(func, *, inplace=False)

Apply a vectorized operation to all spectra.

Parameters:

Name	Type	Description	Default
`func`	`Callable[[ndarray], ndarray]`	Function that accepts `x` (shape (n_samples, n_wavenumbers)) and returns an array of the same shape.	required
`inplace`	`bool`	If True, modify `x` in place and return self; if False, return a new dataset copy.	`False`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Self (if `inplace=True`) or a new dataset with
	`'FoodSpectrumSet'`	transformed spectra.

Raises:

Type	Description
`ValueError`	If the returned array shape differs from `x`.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds2 = ds.apply(lambda arr: arr * 2)
>>> float(ds2.x.mean())
2.0

concat `classmethod` ¶

concat(datasets)

Concatenate multiple datasets with shared wavenumber grids.

Parameters:

Name	Type	Description	Default
`datasets`	`Sequence[FoodSpectrumSet]`	Non-empty iterable of datasets with identical `wavenumbers`.	required

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Combined dataset with stacked `x` rows and
	`'FoodSpectrumSet'`	concatenated `metadata`; annotation column names copied from the
	`'FoodSpectrumSet'`	first dataset.

Raises:

Type	Description
`ValueError`	If `datasets` is empty or wavenumber grids differ.

Examples:

>>> import numpy as np, pandas as pd
>>> ds1 = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [0]}))
>>> ds2 = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [1, 1]}))
>>> merged = FoodSpectrumSet.concat([ds1, ds2])
>>> merged.x.shape
(3, 2)

copy ¶

copy(deep=True)

Return a copy of the dataset.

Parameters:

Name	Type	Description	Default
`deep`	`bool`	If True, copy arrays/metadata; if False, reuse references (changes mutate the original data).	`True`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Copy with identical content.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> shallow = ds.copy(deep=False)
>>> shallow.x is ds.x
True

from_hdf5 `classmethod` ¶

from_hdf5(path, key='foodspec')

Load dataset from HDF5 created by to_hdf5.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	HDF5 file path produced by `to_hdf5`.	required
`key`	`str`	Prefix used when saving (default "foodspec").	`'foodspec'`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Dataset reconstructed from stored arrays and
	`'FoodSpectrumSet'`	metadata.

Raises:

Type	Description
`FileNotFoundError`	If `path` does not exist.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)
>>> _ = FoodSpectrumSet.from_hdf5(tmp.name)

from_parquet `classmethod` ¶

from_parquet(path)

Load dataset from Parquet created by to_parquet.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Parquet file written by `to_parquet`.	required

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Dataset reconstructed from wide format.

Raises:

Type	Description
`FileNotFoundError`	If `path` does not exist.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)
>>> _ = FoodSpectrumSet.from_parquet(tmp.name)

offset ¶

offset(value, *, inplace=False)

Add a constant offset to spectral intensities.

Parameters:

Name	Type	Description	Default
`value`	`float`	Constant added to every element of `x`.	required
`inplace`	`bool`	If True, mutate `x` and return self; otherwise return a new dataset.	`False`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Offset dataset (self if `inplace=True`).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.zeros((1, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds.offset(5).x.tolist()
[[5.0, 5.0, 5.0]]

scale ¶

scale(factor, *, inplace=False)

Scale spectral intensities by a factor.

Parameters:

Name	Type	Description	Default
`factor`	`float`	Multiplicative scalar applied to all intensities.	required
`inplace`	`bool`	If True, mutate `x` and return self; otherwise return a new dataset.	`False`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Scaled dataset (self if `inplace=True`).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds.scale(10).x.mean()
10.0

select_wavenumber_range ¶

select_wavenumber_range(min_wn, max_wn)

Return spectra restricted to a wavenumber window.

Parameters:

Name	Type	Description	Default
`min_wn`	`float`	Inclusive lower bound of wavenumber window.	required
`max_wn`	`float`	Inclusive upper bound of wavenumber window.	required

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Dataset containing columns where
	`'FoodSpectrumSet'`	`min_wn <= wavenumbers <= max_wn`; metadata unchanged.

Raises:

Type	Description
`ValueError`	If bounds are inverted or no wavenumbers fall inside the interval.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.array([500., 750., 1000., 1250.]), metadata=pd.DataFrame())
>>> ds_win = ds.select_wavenumber_range(700, 1100)
>>> ds_win.wavenumbers.tolist()
[750.0, 1000.0]

subset ¶

subset(by=None, indices=None)

Subset by metadata filters and/or explicit indices.

Parameters:

Name	Type	Description	Default
`by`	`dict[str, Any] \| None`	Column → value filters applied to `metadata`. If a value is sequence-like, membership (`isin`) is used; otherwise equality is used.	`None`
`indices`	`Sequence[int] \| None`	Explicit zero-based indices to retain. If both `by` and `indices` are provided, their intersection (order of `indices`) is returned.	`None`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	New dataset with selected rows; wavenumbers are
	`'FoodSpectrumSet'`	preserved and metadata reindexed.

Raises:

Type	Description
`ValueError`	If requested metadata columns are missing, indices are out of range, or indices are not 1D.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0], "split": ["train", "test", "train"]})
>>> ds = FoodSpectrumSet(x=np.ones((3, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> ds_train = ds.subset(by={"split": "train"})
>>> len(ds_train)
2

to_X_y ¶

to_X_y(target_col)

Return (X, y) for a target column in metadata.

Parameters:

Name	Type	Description	Default
`target_col`	`str`	Metadata column name to use as labels.	required

Returns:

Type	Description
`ndarray`	tuple[np.ndarray, np.ndarray]: `X` shape (n_samples, n_wavenumbers),
`ndarray`	`y` shape (n_samples,).

Raises:

Type	Description
`ValueError`	If `target_col` is missing from metadata.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> X, y = ds.to_X_y("label")
>>> X.shape, y.tolist()
((2, 4), [0, 1])

to_hdf5 ¶

to_hdf5(path, key='foodspec', mode='w', complevel=4)

Persist dataset to HDF5 (lazy-friendly storage).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination file path. Parent directories must exist.	required
`key`	`str`	Prefix for the HDF5 groups created (`<key>_x`, `<key>_wn`, `<key>_meta`, `<key>_info`).	`'foodspec'`
`mode`	`str`	HDF5 store mode, e.g., `"w"` or `"a"`.	`'w'`
`complevel`	`int`	Compression level for zlib (0-9).	`4`

Returns:

Name	Type	Description
`Path`	`Path`	Path to the written HDF5 file.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)

to_parquet ¶

to_parquet(path)

Persist dataset to Parquet using wide layout.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination parquet path.	required

Returns:

Name	Type	Description
`Path`	`Path`	Path to the written parquet file.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)

to_wide_dataframe ¶

to_wide_dataframe()

Convert to a wide DataFrame.

Returns:

Type	Description
`DataFrame`	pandas.DataFrame: Metadata columns followed by intensity columns
`DataFrame`	named `int_<wavenumber>` (floats preserved). Shape:
`DataFrame`	(n_samples, n_metadata + n_wavenumbers).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1000., 1001., 1002.]), metadata=pd.DataFrame({"label": [0,1]}))
>>> df = ds.to_wide_dataframe()
>>> list(df.columns)[:2]
['label', 'int_1000.0']

train_test_split ¶

train_test_split(
    target_col,
    test_size=0.3,
    stratify=True,
    random_state=None,
)

Split into train/test FoodSpectrumSets.

Parameters:

Name	Type	Description	Default
`target_col`	`str`	Column in `metadata` used as labels for stratification and copied into splits.	required
`test_size`	`float`	Proportion of samples in the test split.	`0.3`
`stratify`	`bool`	If True, stratify by `target_col`.	`True`
`random_state`	`int \| None`	Seed for reproducibility.	`None`

Returns:

Type	Description
`'FoodSpectrumSet'`	tuple[FoodSpectrumSet, FoodSpectrumSet]: `(train_ds, test_ds)`
`'FoodSpectrumSet'`	sharing the original wavenumber grid; metadata is reindexed.

Raises:

Type	Description
`ValueError`	If `target_col` does not exist in metadata.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((4, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> train, test = ds.train_test_split("label", test_size=0.5, random_state=0)
>>> len(train), len(test)
(2, 2)

validate ¶

validate()

Validate array shapes, wavenumber axis, metadata length, and modality.

Raises:

Type	Description
`ValueError`	If shapes mismatch, wavenumbers are non-monotonic, too few points (<3), metadata length mismatches samples, modality is invalid, or configured annotation columns are missing.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1., 2., 3.]), metadata=pd.DataFrame())
>>> ds.validate()  # does not raise

with_annotations ¶

with_annotations(
    *, label_col=None, group_col=None, batch_col=None
)

Return a copy with updated label/group/batch annotations.

Parameters:

Name	Type	Description	Default
`label_col`	`str \| None`	Name of label column in `metadata`.	`None`
`group_col`	`str \| None`	Name of grouping column (e.g., folds).	`None`
`batch_col`	`str \| None`	Name of batch identifier column.	`None`

Returns:

Name	Type	Description
`FoodSpectrumSet`	`'FoodSpectrumSet'`	Copy sharing data/wavenumbers but with annotation
	`'FoodSpectrumSet'`	column names updated (metadata deep-copied).

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"y": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=meta)
>>> ds2 = ds.with_annotations(label_col="y")
>>> ds2.label_col
'y'

Spectrum¶

Single spectrum data model with validation.

Single spectrum with axis, intensity, units, kind, and metadata.

Represents a single spectroscopic measurement with provenance tracking.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	X-axis (wavenumber/wavelength), shape (n_points,).	required
`y`	`ndarray`	Intensity values, shape (n_points,).	required
`kind`	`Literal['raman', 'ftir', 'nir']`	Spectroscopy modality.	required
`x_unit`	`Literal['cm-1', 'nm', 'um', '1/cm']`	Axis unit. Default 'cm-1'.	`'cm-1'`
`metadata`	`dict`	Optional metadata (sample_id, instrument, etc.).	`dict()`

Attributes:

Name	Type	Description
`x`	`ndarray`	X-axis data.
`y`	`ndarray`	Y-axis data.
`kind`	`str`	Modality.
`x_unit`	`str`	Unit of x-axis.
`metadata`	`dict`	Validated metadata.
`config_hash`	`str`	Hash of metadata for reproducibility tracking.

config_hash `property` ¶

config_hash

Hash of metadata for reproducibility tracking.

Returns:

Name	Type	Description
`str`	`str`	First 8 hex chars of SHA256 over metadata JSON.

n_points `property` ¶

n_points

Number of spectral points.

Returns:

Name	Type	Description
`int`	`int`	Length of `x`/`y`.

__post_init__ ¶

__post_init__()

Validate and normalize inputs (shapes, modality, metadata).

repr ¶

__repr__()

String representation.

copy ¶

copy()

Return a deep copy of this spectrum.

Returns:

Name	Type	Description
`Spectrum`	`Spectrum`	Independent copy.

crop_wavenumber ¶

crop_wavenumber(x_min, x_max)

Crop spectrum to a wavenumber/wavelength range.

Parameters:

Name	Type	Description	Default
`x_min`	`float`	Minimum axis value.	required
`x_max`	`float`	Maximum axis value.	required

Returns:

Name	Type	Description
`Spectrum`	`Spectrum`	New spectrum with cropped data.

Raises:

Type	Description
`ValueError`	If the range contains no points.

normalize ¶

normalize(method='vector')

Normalize spectrum.

Parameters:

Name	Type	Description	Default
`method`	`str`	One of "vector", "max", or "area".	`'vector'`

Returns:

Name	Type	Description
`Spectrum`	`Spectrum`	Normalized spectrum.

Raises:

Type	Description
`ValueError`	If method is unknown.

OutputBundle¶

Structured output packaging for analysis results.

Unified container for workflow outputs: metrics, diagnostics, provenance, artifacts.

Manages the triple output (metrics + diagnostics + provenance) and exports to disk.

Parameters:

Name	Type	Description	Default
`run_record`	`RunRecord`	Provenance record for the workflow.	required

Attributes:

Name	Type	Description
`metrics`	`dict`	Quantitative results (accuracy, F1, RMSE, etc.).
`diagnostics`	`dict`	Plots and tables (confusion matrix, feature importance, etc.).
`artifacts`	`dict`	Portable exports (model, preprocessor, etc.).
`run_record`	`RunRecord`	Provenance.

repr ¶

__repr__()

String representation.

add_artifact ¶

add_artifact(name, value)

Add an artifact (model, preprocessor, scaler, etc.).

Parameters:

Name	Type	Description	Default
`name`	`str`	Artifact name (e.g., "model").	required
`value`	`Any`	Artifact object.	required

add_diagnostic ¶

add_diagnostic(name, value)

Add a diagnostic (plot, table, figure).

Parameters:

Name	Type	Description	Default
`name`	`str`	Diagnostic name (e.g., "confusion_matrix").	required
`value`	`Any`	Diagnostic (Figure, ndarray, DataFrame, dict, str).	required

add_metrics ¶

add_metrics(name, value)

Add a metric.

Parameters:

Name	Type	Description	Default
`name`	`str`	Metric name (e.g., "accuracy").	required
`value`	`Any`	Metric value (number, array, DataFrame).	required

export ¶

export(output_dir, formats=None)

Export bundle to disk.

Exports: - metrics.json - diagnostics/ (plots as PNG/PDF, tables as CSV) - artifacts/ (models as joblib/pickle) - provenance.json (run_record)

Parameters:

Name	Type	Description	Default
`output_dir`	`str \| Path`	Output directory.	required
`formats`	`list[str] \| None`	Export formats. Default: ["json", "csv", "png", "joblib"].	`None`

Returns:

Name	Type	Description
`Path`	`Path`	Output directory path.

summary ¶

summary()

Generate human-readable summary of outputs.

Returns:

Name	Type	Description
`str`	`str`	Summary string.

RunRecord¶

Provenance tracking for reproducible analyses.

Immutable record of a workflow execution with full provenance.

Tracks configuration, dataset hash, step history, environment, timing, and user info.

Parameters:

Name	Type	Description	Default
`workflow_name`	`str`	Name of the workflow (e.g., "oil_authentication").	required
`config`	`dict`	Configuration parameters.	required
`dataset_hash`	`str`	SHA256 hash of input dataset.	required
`environment`	`dict \| None`	Environment info (Python version, package versions, etc.).	`dict()`
`step_records`	`list[dict] \| None`	Step records: {"name","hash","timestamp","error"}.	`list()`
`user`	`str \| None`	User who ran the workflow.	`None`
`notes`	`str \| None`	Freeform notes.	`None`

Attributes:

Name	Type	Description
`workflow_name`	`str`	Name of the workflow.
`config`	`Dict[str, Any]`	Configuration parameters.
`config_hash`	`str`	SHA256 hash of config.
`dataset_hash`	`str`	SHA256 hash of input dataset.
`environment`	`Dict[str, Any]`	Environment info (Python version, packages, etc.).
`step_records`	`List[Dict[str, Any]]`	Step records with name, hash, timestamp, error.
`user`	`Optional[str]`	User who ran the workflow.
`notes`	`Optional[str]`	Freeform notes.
`timestamp`	`str`	ISO 8601 timestamp (UTC).
`run_id`	`str`	Unique run identifier.

combined_hash `property` ¶

combined_hash

Combined hash of config + dataset + all steps.

Returns:

Name	Type	Description
`str`	`str`	First 8 hex chars of combined SHA256.

config_hash `property` ¶

config_hash

Hash of configuration.

Returns:

Name	Type	Description
`str`	`str`	First 8 hex chars of SHA256 over config JSON.

run_id `property` ¶

run_id

Unique run identifier combining workflow name and timestamp.

Returns:

Name	Type	Description
`str`	`str`	Deterministic identifier for the run.

__post_init__ ¶

__post_init__()

Finalize and validate run record.

repr ¶

__repr__()

String representation.

add_output_path ¶

add_output_path(path)

Record an output path for the run (e.g., exported bundle location).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Output directory/file path.	required

add_step ¶

add_step(name, step_hash, error=None, metadata=None)

Record a workflow step.

Parameters:

Name	Type	Description	Default
`name`	`str`	Step name (e.g., "baseline_correction").	required
`step_hash`	`str`	Hash of step output or configuration.	required
`error`	`str \| None`	Error message if step failed.	`None`
`metadata`	`dict \| None`	Additional metadata for this step.	`None`

from_json `classmethod` ¶

from_json(path)

Load from JSON file.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Input file path.	required

Returns:

Name	Type	Description
`RunRecord`	`RunRecord`	Deserialized record.

to_dict ¶

to_dict()

Serialize to dictionary.

Returns:

Name	Type	Description
`dict`	`Dict[str, Any]`	JSON-serializable representation of the run record.

to_json ¶

to_json(path)

Write to JSON file.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Output file path.	required

Returns:

Name	Type	Description
`Path`	`Path`	Path to written file.

Advanced Data Structures¶

SpectralDataset¶

Extended dataset with preprocessing pipeline integration.

Matrix-form spectra with aligned metadata and instrument info.

Attributes:

Name	Type	Description
`wavenumbers`	`ndarray`	Axis values (cm^-1).
`spectra`	`ndarray`	Shape (n_samples, n_points) intensities.
`metadata`	`DataFrame`	Sample annotations.
`instrument_meta`	`dict`	Instrument/protocol metadata.
`logs`	`list[str]`	Operation logs.
`history`	`list[dict]`	Preprocessing steps applied.

copy ¶

copy()

Deep copy of dataset arrays, metadata, and meta fields.

Returns:

Name	Type	Description
`SpectralDataset`	`'SpectralDataset'`	Independent copy.

from_hdf5 `staticmethod` ¶

from_hdf5(path, *, allow_future=False)

Load dataset from HDF5, supporting legacy layout.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	HDF5 file path.	required
`allow_future`	`bool`	If True, tolerate newer minor schema.	`False`

Returns:

Name	Type	Description
`SpectralDataset`	`'SpectralDataset'`	Reconstructed dataset.

Raises:

Type	Description
`ImportError`	If `h5py` is not installed.
`ValueError`	If schema version incompatible.

preprocess ¶

preprocess(options)

Apply configured preprocessing to spectra.

Parameters:

Name	Type	Description	Default
`options`	`PreprocessingConfig`	Pipeline options.	required

Returns:

Name	Type	Description
`SpectralDataset`	`'SpectralDataset'`	New dataset with processed spectra.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)

save_hdf5 ¶

save_hdf5(path)

Save dataset to HDF5 with schema versioning and legacy keys.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination file path.	required

Raises:

Type	Description
`ImportError`	If `h5py` is not installed.

to_peaks ¶

to_peaks(peaks)

Extract peak features into a wide DataFrame.

Parameters:

Name	Type	Description	Default
`peaks`	`Iterable[PeakDefinition]`	Peak definitions with names, windows, and modes.	required

Returns:

Type	Description
`DataFrame`	pandas.DataFrame: `metadata` columns plus one column per peak.

HyperspectralDataset¶

Hyperspectral imaging (3D spatial-spectral) data container.

Bases: SpectralDataset

Hyperspectral cube flattened to spectra with spatial tracking.

The cube has shape (y, x, wn) but is stored as spectra with shape (n_pixels, wn); spatial dimensions are given by shape_xy.

copy ¶

copy()

Deep copy of dataset arrays, metadata, and meta fields.

Returns:

Name	Type	Description
`SpectralDataset`	`'SpectralDataset'`	Independent copy.

from_cube `staticmethod` ¶

from_cube(
    cube, wavenumbers, metadata, instrument_meta=None
)

Construct flattened dataset from a hyperspectral cube.

Parameters:

Name	Type	Description	Default
`cube`	`ndarray`	Array shape (y, x, wn).	required
`wavenumbers`	`ndarray`	Axis values.	required
`metadata`	`DataFrame`	Sample/site metadata.	required
`instrument_meta`	`dict \| None`	Instrument details.	`None`

Returns:

Name	Type	Description
`HyperspectralDataset`	`'HyperspectralDataset'`	Flattened dataset with `shape_xy` set.

from_hdf5 `staticmethod` ¶

from_hdf5(path, *, allow_future=False)

Load hyperspectral dataset from HDF5, supporting legacy layout.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	HDF5 path.	required
`allow_future`	`bool`	If True, tolerate newer minor schema.	`False`

Returns:

Name	Type	Description
`HyperspectralDataset`	`'HyperspectralDataset'`	Reconstructed dataset with `shape_xy` and
	`'HyperspectralDataset'`	optional ROI assets.

Raises:

Type	Description
`ImportError`	If `h5py` is not installed.
`ValueError`	If schema version incompatible.

preprocess ¶

preprocess(options)

Apply configured preprocessing to spectra.

Parameters:

Name	Type	Description	Default
`options`	`PreprocessingConfig`	Pipeline options.	required

Returns:

Name	Type	Description
`SpectralDataset`	`'SpectralDataset'`	New dataset with processed spectra.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)

roi_spectrum ¶

roi_spectrum(mask)

Average spectrum from a binary ROI mask.

Parameters:

Name	Type	Description	Default
`mask`	`ndarray`	Boolean/int array of shape `shape_xy`.	required

Returns:

Name	Type	Description
`SpectralDataset`	`SpectralDataset`	Single-row dataset with the average ROI spectrum.

Raises:

Type	Description
`ValueError`	If `mask` shape mismatches `shape_xy`.

save_hdf5 ¶

save_hdf5(path)

Save hyperspectral cube to HDF5, including ROI artifacts if present.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination file path.	required

Raises:

Type	Description
`ImportError`	If `h5py` is not installed.

segment ¶

segment(method='kmeans', n_clusters=3)

Segment pixels into clusters using kmeans, hierarchical, or NMF.

Parameters:

Name	Type	Description	Default
`method`	`str`	"kmeans" \| "hierarchical" \| "nmf".	`'kmeans'`
`n_clusters`	`int`	Number of clusters/components.	`3`

Returns:

Type	Description
`ndarray`	np.ndarray: Label map shape `shape_xy`.

Raises:

Type	Description
`ValueError`	If method unknown.

to_cube ¶

to_cube()

Reshape spectra back to (y, x, wn) cube using shape_xy.

Returns:

Type	Description
`ndarray`	np.ndarray: Hyperspectral cube.

to_peaks ¶

to_peaks(peaks)

Extract peak features into a wide DataFrame.

Parameters:

Name	Type	Description	Default
`peaks`	`Iterable[PeakDefinition]`	Peak definitions with names, windows, and modes.	required

Returns:

Type	Description
`DataFrame`	pandas.DataFrame: `metadata` columns plus one column per peak.

Helper Functions¶

Conversion Utilities¶

Return (X, y) arrays suitable for scikit-learn.

Parameters¶

ds : FoodSpectrumSet Dataset to convert. label_col : Optional[str] Column to use for labels; if None, uses ds.label_col if available.

Returns¶

(X, y) X shape (n_samples, n_features). y is None if label column not found.

Examples¶

import numpy as np, pandas as pd ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame({"label": [0, 1]})) X, y = to_sklearn(ds) X.shape, y.tolist() ((2, 3), [0, 1])

Create a FoodSpectrumSet from scikit-learn style inputs.

Parameters¶

X : np.ndarray Feature matrix of shape (n_samples, n_wavenumbers). y : Optional[Sequence] Optional labels aligned to rows in X. wavenumbers : Sequence[float] Spectral axis values; if empty, uses 0..n_features-1. modality : Modality Modality tag (e.g., 'raman'). labels_name : str Name of the label column in metadata if y is provided.

Returns¶

FoodSpectrumSet Dataset constructed from the matrix and optional labels.

Examples¶

import numpy as np ds = from_sklearn(np.ones((2, 4)), y=[0, 1], wavenumbers=[1.0, 2.0, 3.0, 4.0]) ds.wavenumbers.tolist() [1.0, 2.0, 3.0, 4.0]

Core API¶

Main Classes¶

FoodSpectrumSet¶

Parameters¶

batch_ids property ¶

groups property ¶

labels property ¶

__getitem__ ¶

__len__ ¶

add_metadata_column ¶

apply ¶

concat classmethod ¶

copy ¶

from_hdf5 classmethod ¶

from_parquet classmethod ¶

offset ¶

scale ¶

select_wavenumber_range ¶

subset ¶

to_X_y ¶

to_hdf5 ¶

to_parquet ¶

to_wide_dataframe ¶

train_test_split ¶

validate ¶

with_annotations ¶

Spectrum¶

config_hash property ¶

n_points property ¶

__post_init__ ¶

__repr__ ¶

copy ¶

crop_wavenumber ¶

normalize ¶

OutputBundle¶

__repr__ ¶

add_artifact ¶

add_diagnostic ¶

add_metrics ¶

export ¶

summary ¶

RunRecord¶

combined_hash property ¶

config_hash property ¶

run_id property ¶

__post_init__ ¶

__repr__ ¶

add_output_path ¶

add_step ¶

from_json classmethod ¶

to_dict ¶

to_json ¶

Advanced Data Structures¶

SpectralDataset¶

copy ¶

from_hdf5 staticmethod ¶

preprocess ¶

save_hdf5 ¶

to_peaks ¶

HyperspectralDataset¶

copy ¶

from_cube staticmethod ¶

from_hdf5 staticmethod ¶

preprocess ¶

roi_spectrum ¶

save_hdf5 ¶

segment ¶

to_cube ¶

to_peaks ¶

Helper Functions¶

Conversion Utilities¶

Parameters¶

Returns¶

Examples¶

Parameters¶

Returns¶

Examples¶

See Also¶

batch_ids `property` ¶

groups `property` ¶

labels `property` ¶

getitem ¶

len ¶

concat `classmethod` ¶

from_hdf5 `classmethod` ¶

from_parquet `classmethod` ¶

config_hash `property` ¶

n_points `property` ¶

repr ¶

repr ¶

combined_hash `property` ¶

config_hash `property` ¶

run_id `property` ¶

repr ¶

from_json `classmethod` ¶

from_hdf5 `staticmethod` ¶

from_cube `staticmethod` ¶

from_hdf5 `staticmethod` ¶