Skip to content

Core API

Core data structures and workflows for spectral analysis.

The foodspec.core module provides foundational classes for working with spectral data, including dataset containers, single spectrum operations, and result packaging.

Main Classes

FoodSpectrumSet

Primary container for spectral datasets with aligned metadata.

Collection of spectra with aligned metadata and axis information.

Parameters

x : Array of shape (n_samples, n_wavenumbers) containing spectral intensities. wavenumbers : Array of shape (n_wavenumbers,) with the spectral axis values. metadata : DataFrame with one row per sample storing labels and acquisition info. modality : Spectroscopy modality identifier: "raman", "ftir", or "nir".

batch_ids property

batch_ids

Return batch identifier column if configured.

Returns:

Type Description
Optional[Series]

pandas.Series | None: Batch/run identifiers or None if

Optional[Series]

batch_col is not set or missing.

groups property

groups

Return grouping column if configured.

Returns:

Type Description
Optional[Series]

pandas.Series | None: Group identifiers (e.g., folds) or None if

Optional[Series]

group_col is not set or missing.

labels property

labels

Return label column if configured.

Returns:

Type Description
Optional[Series]

pandas.Series | None: Label values aligned to samples, or None if

Optional[Series]

label_col is not set or missing.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": ["A", "B"]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> ds.labels.tolist()
['A', 'B']

__getitem__

__getitem__(index)

Return a subset by integer position.

Parameters:

Name Type Description Default
index int | slice

Zero-based row index or slice over samples.

required

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

New dataset containing x/metadata rows

'FoodSpectrumSet'

selected by index; wavenumbers are copied.

Raises:

Type Description
IndexError

If an integer index is out of range.

TypeError

If index is not an int or slice.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.arange(6).reshape(3, 2), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds_sub = ds[1:]
>>> ds_sub.x.shape
(2, 2)

__len__

__len__()

Number of spectra in the set.

Returns:

Name Type Description
int int

Number of samples (axis 0 of x).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(
...     x=np.ones((3, 5)),
...     wavenumbers=np.arange(5),
...     metadata=pd.DataFrame({"label": [0, 1, 0]}),
... )
>>> len(ds)
3

add_metadata_column

add_metadata_column(name, values, *, overwrite=False)

Attach a metadata column aligned with spectra.

Parameters:

Name Type Description Default
name str

Column name to add to metadata.

required
values Sequence[Any]

Iterable of length n_samples containing values aligned to rows.

required
overwrite bool

If True, replace an existing column of the same name; otherwise raise.

False

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

New dataset with the added/overwritten column.

Raises:

Type Description
ValueError

If lengths mismatch or column exists and overwrite is False.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds2 = ds.add_metadata_column("batch", [1, 2])
>>> ds2.metadata["batch"].tolist()
[1, 2]

apply

apply(func, *, inplace=False)

Apply a vectorized operation to all spectra.

Parameters:

Name Type Description Default
func Callable[[ndarray], ndarray]

Function that accepts x (shape (n_samples, n_wavenumbers)) and returns an array of the same shape.

required
inplace bool

If True, modify x in place and return self; if False, return a new dataset copy.

False

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Self (if inplace=True) or a new dataset with

'FoodSpectrumSet'

transformed spectra.

Raises:

Type Description
ValueError

If the returned array shape differs from x.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds2 = ds.apply(lambda arr: arr * 2)
>>> float(ds2.x.mean())
2.0

concat classmethod

concat(datasets)

Concatenate multiple datasets with shared wavenumber grids.

Parameters:

Name Type Description Default
datasets Sequence[FoodSpectrumSet]

Non-empty iterable of datasets with identical wavenumbers.

required

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Combined dataset with stacked x rows and

'FoodSpectrumSet'

concatenated metadata; annotation column names copied from the

'FoodSpectrumSet'

first dataset.

Raises:

Type Description
ValueError

If datasets is empty or wavenumber grids differ.

Examples:

>>> import numpy as np, pandas as pd
>>> ds1 = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [0]}))
>>> ds2 = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [1, 1]}))
>>> merged = FoodSpectrumSet.concat([ds1, ds2])
>>> merged.x.shape
(3, 2)

copy

copy(deep=True)

Return a copy of the dataset.

Parameters:

Name Type Description Default
deep bool

If True, copy arrays/metadata; if False, reuse references (changes mutate the original data).

True

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Copy with identical content.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> shallow = ds.copy(deep=False)
>>> shallow.x is ds.x
True

from_hdf5 classmethod

from_hdf5(path, key='foodspec')

Load dataset from HDF5 created by to_hdf5.

Parameters:

Name Type Description Default
path str | Path

HDF5 file path produced by to_hdf5.

required
key str

Prefix used when saving (default "foodspec").

'foodspec'

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Dataset reconstructed from stored arrays and

'FoodSpectrumSet'

metadata.

Raises:

Type Description
FileNotFoundError

If path does not exist.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)
>>> _ = FoodSpectrumSet.from_hdf5(tmp.name)

from_parquet classmethod

from_parquet(path)

Load dataset from Parquet created by to_parquet.

Parameters:

Name Type Description Default
path str | Path

Parquet file written by to_parquet.

required

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Dataset reconstructed from wide format.

Raises:

Type Description
FileNotFoundError

If path does not exist.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)
>>> _ = FoodSpectrumSet.from_parquet(tmp.name)

offset

offset(value, *, inplace=False)

Add a constant offset to spectral intensities.

Parameters:

Name Type Description Default
value float

Constant added to every element of x.

required
inplace bool

If True, mutate x and return self; otherwise return a new dataset.

False

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Offset dataset (self if inplace=True).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.zeros((1, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds.offset(5).x.tolist()
[[5.0, 5.0, 5.0]]

scale

scale(factor, *, inplace=False)

Scale spectral intensities by a factor.

Parameters:

Name Type Description Default
factor float

Multiplicative scalar applied to all intensities.

required
inplace bool

If True, mutate x and return self; otherwise return a new dataset.

False

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Scaled dataset (self if inplace=True).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds.scale(10).x.mean()
10.0

select_wavenumber_range

select_wavenumber_range(min_wn, max_wn)

Return spectra restricted to a wavenumber window.

Parameters:

Name Type Description Default
min_wn float

Inclusive lower bound of wavenumber window.

required
max_wn float

Inclusive upper bound of wavenumber window.

required

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Dataset containing columns where

'FoodSpectrumSet'

min_wn <= wavenumbers <= max_wn; metadata unchanged.

Raises:

Type Description
ValueError

If bounds are inverted or no wavenumbers fall inside the interval.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.array([500., 750., 1000., 1250.]), metadata=pd.DataFrame())
>>> ds_win = ds.select_wavenumber_range(700, 1100)
>>> ds_win.wavenumbers.tolist()
[750.0, 1000.0]

subset

subset(by=None, indices=None)

Subset by metadata filters and/or explicit indices.

Parameters:

Name Type Description Default
by dict[str, Any] | None

Column → value filters applied to metadata. If a value is sequence-like, membership (isin) is used; otherwise equality is used.

None
indices Sequence[int] | None

Explicit zero-based indices to retain. If both by and indices are provided, their intersection (order of indices) is returned.

None

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

New dataset with selected rows; wavenumbers are

'FoodSpectrumSet'

preserved and metadata reindexed.

Raises:

Type Description
ValueError

If requested metadata columns are missing, indices are out of range, or indices are not 1D.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0], "split": ["train", "test", "train"]})
>>> ds = FoodSpectrumSet(x=np.ones((3, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> ds_train = ds.subset(by={"split": "train"})
>>> len(ds_train)
2

to_X_y

to_X_y(target_col)

Return (X, y) for a target column in metadata.

Parameters:

Name Type Description Default
target_col str

Metadata column name to use as labels.

required

Returns:

Type Description
ndarray

tuple[np.ndarray, np.ndarray]: X shape (n_samples, n_wavenumbers),

ndarray

y shape (n_samples,).

Raises:

Type Description
ValueError

If target_col is missing from metadata.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> X, y = ds.to_X_y("label")
>>> X.shape, y.tolist()
((2, 4), [0, 1])

to_hdf5

to_hdf5(path, key='foodspec', mode='w', complevel=4)

Persist dataset to HDF5 (lazy-friendly storage).

Parameters:

Name Type Description Default
path str | Path

Destination file path. Parent directories must exist.

required
key str

Prefix for the HDF5 groups created (<key>_x, <key>_wn, <key>_meta, <key>_info).

'foodspec'
mode str

HDF5 store mode, e.g., "w" or "a".

'w'
complevel int

Compression level for zlib (0-9).

4

Returns:

Name Type Description
Path Path

Path to the written HDF5 file.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)

to_parquet

to_parquet(path)

Persist dataset to Parquet using wide layout.

Parameters:

Name Type Description Default
path str | Path

Destination parquet path.

required

Returns:

Name Type Description
Path Path

Path to the written parquet file.

Examples:

>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)

to_wide_dataframe

to_wide_dataframe()

Convert to a wide DataFrame.

Returns:

Type Description
DataFrame

pandas.DataFrame: Metadata columns followed by intensity columns

DataFrame

named int_<wavenumber> (floats preserved). Shape:

DataFrame

(n_samples, n_metadata + n_wavenumbers).

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1000., 1001., 1002.]), metadata=pd.DataFrame({"label": [0,1]}))
>>> df = ds.to_wide_dataframe()
>>> list(df.columns)[:2]
['label', 'int_1000.0']

train_test_split

train_test_split(
    target_col,
    test_size=0.3,
    stratify=True,
    random_state=None,
)

Split into train/test FoodSpectrumSets.

Parameters:

Name Type Description Default
target_col str

Column in metadata used as labels for stratification and copied into splits.

required
test_size float

Proportion of samples in the test split.

0.3
stratify bool

If True, stratify by target_col.

True
random_state int | None

Seed for reproducibility.

None

Returns:

Type Description
'FoodSpectrumSet'

tuple[FoodSpectrumSet, FoodSpectrumSet]: (train_ds, test_ds)

'FoodSpectrumSet'

sharing the original wavenumber grid; metadata is reindexed.

Raises:

Type Description
ValueError

If target_col does not exist in metadata.

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((4, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> train, test = ds.train_test_split("label", test_size=0.5, random_state=0)
>>> len(train), len(test)
(2, 2)

validate

validate()

Validate array shapes, wavenumber axis, metadata length, and modality.

Raises:

Type Description
ValueError

If shapes mismatch, wavenumbers are non-monotonic, too few points (<3), metadata length mismatches samples, modality is invalid, or configured annotation columns are missing.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1., 2., 3.]), metadata=pd.DataFrame())
>>> ds.validate()  # does not raise

with_annotations

with_annotations(
    *, label_col=None, group_col=None, batch_col=None
)

Return a copy with updated label/group/batch annotations.

Parameters:

Name Type Description Default
label_col str | None

Name of label column in metadata.

None
group_col str | None

Name of grouping column (e.g., folds).

None
batch_col str | None

Name of batch identifier column.

None

Returns:

Name Type Description
FoodSpectrumSet 'FoodSpectrumSet'

Copy sharing data/wavenumbers but with annotation

'FoodSpectrumSet'

column names updated (metadata deep-copied).

Examples:

>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"y": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=meta)
>>> ds2 = ds.with_annotations(label_col="y")
>>> ds2.label_col
'y'

Spectrum

Single spectrum data model with validation.

Single spectrum with axis, intensity, units, kind, and metadata.

Represents a single spectroscopic measurement with provenance tracking.

Parameters:

Name Type Description Default
x ndarray

X-axis (wavenumber/wavelength), shape (n_points,).

required
y ndarray

Intensity values, shape (n_points,).

required
kind Literal['raman', 'ftir', 'nir']

Spectroscopy modality.

required
x_unit Literal['cm-1', 'nm', 'um', '1/cm']

Axis unit. Default 'cm-1'.

'cm-1'
metadata dict

Optional metadata (sample_id, instrument, etc.).

dict()

Attributes:

Name Type Description
x ndarray

X-axis data.

y ndarray

Y-axis data.

kind str

Modality.

x_unit str

Unit of x-axis.

metadata dict

Validated metadata.

config_hash str

Hash of metadata for reproducibility tracking.

config_hash property

config_hash

Hash of metadata for reproducibility tracking.

Returns:

Name Type Description
str str

First 8 hex chars of SHA256 over metadata JSON.

n_points property

n_points

Number of spectral points.

Returns:

Name Type Description
int int

Length of x/y.

__post_init__

__post_init__()

Validate and normalize inputs (shapes, modality, metadata).

__repr__

__repr__()

String representation.

copy

copy()

Return a deep copy of this spectrum.

Returns:

Name Type Description
Spectrum Spectrum

Independent copy.

crop_wavenumber

crop_wavenumber(x_min, x_max)

Crop spectrum to a wavenumber/wavelength range.

Parameters:

Name Type Description Default
x_min float

Minimum axis value.

required
x_max float

Maximum axis value.

required

Returns:

Name Type Description
Spectrum Spectrum

New spectrum with cropped data.

Raises:

Type Description
ValueError

If the range contains no points.

normalize

normalize(method='vector')

Normalize spectrum.

Parameters:

Name Type Description Default
method str

One of "vector", "max", or "area".

'vector'

Returns:

Name Type Description
Spectrum Spectrum

Normalized spectrum.

Raises:

Type Description
ValueError

If method is unknown.

OutputBundle

Structured output packaging for analysis results.

Unified container for workflow outputs: metrics, diagnostics, provenance, artifacts.

Manages the triple output (metrics + diagnostics + provenance) and exports to disk.

Parameters:

Name Type Description Default
run_record RunRecord

Provenance record for the workflow.

required

Attributes:

Name Type Description
metrics dict

Quantitative results (accuracy, F1, RMSE, etc.).

diagnostics dict

Plots and tables (confusion matrix, feature importance, etc.).

artifacts dict

Portable exports (model, preprocessor, etc.).

run_record RunRecord

Provenance.

__repr__

__repr__()

String representation.

add_artifact

add_artifact(name, value)

Add an artifact (model, preprocessor, scaler, etc.).

Parameters:

Name Type Description Default
name str

Artifact name (e.g., "model").

required
value Any

Artifact object.

required

add_diagnostic

add_diagnostic(name, value)

Add a diagnostic (plot, table, figure).

Parameters:

Name Type Description Default
name str

Diagnostic name (e.g., "confusion_matrix").

required
value Any

Diagnostic (Figure, ndarray, DataFrame, dict, str).

required

add_metrics

add_metrics(name, value)

Add a metric.

Parameters:

Name Type Description Default
name str

Metric name (e.g., "accuracy").

required
value Any

Metric value (number, array, DataFrame).

required

export

export(output_dir, formats=None)

Export bundle to disk.

Exports: - metrics.json - diagnostics/ (plots as PNG/PDF, tables as CSV) - artifacts/ (models as joblib/pickle) - provenance.json (run_record)

Parameters:

Name Type Description Default
output_dir str | Path

Output directory.

required
formats list[str] | None

Export formats. Default: ["json", "csv", "png", "joblib"].

None

Returns:

Name Type Description
Path Path

Output directory path.

summary

summary()

Generate human-readable summary of outputs.

Returns:

Name Type Description
str str

Summary string.

RunRecord

Provenance tracking for reproducible analyses.

Immutable record of a workflow execution with full provenance.

Tracks configuration, dataset hash, step history, environment, timing, and user info.

Parameters:

Name Type Description Default
workflow_name str

Name of the workflow (e.g., "oil_authentication").

required
config dict

Configuration parameters.

required
dataset_hash str

SHA256 hash of input dataset.

required
environment dict | None

Environment info (Python version, package versions, etc.).

dict()
step_records list[dict] | None

Step records: {"name","hash","timestamp","error"}.

list()
user str | None

User who ran the workflow.

None
notes str | None

Freeform notes.

None

Attributes:

Name Type Description
workflow_name str

Name of the workflow.

config Dict[str, Any]

Configuration parameters.

config_hash str

SHA256 hash of config.

dataset_hash str

SHA256 hash of input dataset.

environment Dict[str, Any]

Environment info (Python version, packages, etc.).

step_records List[Dict[str, Any]]

Step records with name, hash, timestamp, error.

user Optional[str]

User who ran the workflow.

notes Optional[str]

Freeform notes.

timestamp str

ISO 8601 timestamp (UTC).

run_id str

Unique run identifier.

combined_hash property

combined_hash

Combined hash of config + dataset + all steps.

Returns:

Name Type Description
str str

First 8 hex chars of combined SHA256.

config_hash property

config_hash

Hash of configuration.

Returns:

Name Type Description
str str

First 8 hex chars of SHA256 over config JSON.

run_id property

run_id

Unique run identifier combining workflow name and timestamp.

Returns:

Name Type Description
str str

Deterministic identifier for the run.

__post_init__

__post_init__()

Finalize and validate run record.

__repr__

__repr__()

String representation.

add_output_path

add_output_path(path)

Record an output path for the run (e.g., exported bundle location).

Parameters:

Name Type Description Default
path str | Path

Output directory/file path.

required

add_step

add_step(name, step_hash, error=None, metadata=None)

Record a workflow step.

Parameters:

Name Type Description Default
name str

Step name (e.g., "baseline_correction").

required
step_hash str

Hash of step output or configuration.

required
error str | None

Error message if step failed.

None
metadata dict | None

Additional metadata for this step.

None

from_json classmethod

from_json(path)

Load from JSON file.

Parameters:

Name Type Description Default
path Path | str

Input file path.

required

Returns:

Name Type Description
RunRecord RunRecord

Deserialized record.

to_dict

to_dict()

Serialize to dictionary.

Returns:

Name Type Description
dict Dict[str, Any]

JSON-serializable representation of the run record.

to_json

to_json(path)

Write to JSON file.

Parameters:

Name Type Description Default
path Path | str

Output file path.

required

Returns:

Name Type Description
Path Path

Path to written file.

Advanced Data Structures

SpectralDataset

Extended dataset with preprocessing pipeline integration.

Matrix-form spectra with aligned metadata and instrument info.

Attributes:

Name Type Description
wavenumbers ndarray

Axis values (cm^-1).

spectra ndarray

Shape (n_samples, n_points) intensities.

metadata DataFrame

Sample annotations.

instrument_meta dict

Instrument/protocol metadata.

logs list[str]

Operation logs.

history list[dict]

Preprocessing steps applied.

copy

copy()

Deep copy of dataset arrays, metadata, and meta fields.

Returns:

Name Type Description
SpectralDataset 'SpectralDataset'

Independent copy.

from_hdf5 staticmethod

from_hdf5(path, *, allow_future=False)

Load dataset from HDF5, supporting legacy layout.

Parameters:

Name Type Description Default
path str | Path

HDF5 file path.

required
allow_future bool

If True, tolerate newer minor schema.

False

Returns:

Name Type Description
SpectralDataset 'SpectralDataset'

Reconstructed dataset.

Raises:

Type Description
ImportError

If h5py is not installed.

ValueError

If schema version incompatible.

preprocess

preprocess(options)

Apply configured preprocessing to spectra.

Parameters:

Name Type Description Default
options PreprocessingConfig

Pipeline options.

required

Returns:

Name Type Description
SpectralDataset 'SpectralDataset'

New dataset with processed spectra.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)

save_hdf5

save_hdf5(path)

Save dataset to HDF5 with schema versioning and legacy keys.

Parameters:

Name Type Description Default
path str | Path

Destination file path.

required

Raises:

Type Description
ImportError

If h5py is not installed.

to_peaks

to_peaks(peaks)

Extract peak features into a wide DataFrame.

Parameters:

Name Type Description Default
peaks Iterable[PeakDefinition]

Peak definitions with names, windows, and modes.

required

Returns:

Type Description
DataFrame

pandas.DataFrame: metadata columns plus one column per peak.

HyperspectralDataset

Hyperspectral imaging (3D spatial-spectral) data container.

Bases: SpectralDataset

Hyperspectral cube flattened to spectra with spatial tracking.

The cube has shape (y, x, wn) but is stored as spectra with shape (n_pixels, wn); spatial dimensions are given by shape_xy.

copy

copy()

Deep copy of dataset arrays, metadata, and meta fields.

Returns:

Name Type Description
SpectralDataset 'SpectralDataset'

Independent copy.

from_cube staticmethod

from_cube(
    cube, wavenumbers, metadata, instrument_meta=None
)

Construct flattened dataset from a hyperspectral cube.

Parameters:

Name Type Description Default
cube ndarray

Array shape (y, x, wn).

required
wavenumbers ndarray

Axis values.

required
metadata DataFrame

Sample/site metadata.

required
instrument_meta dict | None

Instrument details.

None

Returns:

Name Type Description
HyperspectralDataset 'HyperspectralDataset'

Flattened dataset with shape_xy set.

from_hdf5 staticmethod

from_hdf5(path, *, allow_future=False)

Load hyperspectral dataset from HDF5, supporting legacy layout.

Parameters:

Name Type Description Default
path str | Path

HDF5 path.

required
allow_future bool

If True, tolerate newer minor schema.

False

Returns:

Name Type Description
HyperspectralDataset 'HyperspectralDataset'

Reconstructed dataset with shape_xy and

'HyperspectralDataset'

optional ROI assets.

Raises:

Type Description
ImportError

If h5py is not installed.

ValueError

If schema version incompatible.

preprocess

preprocess(options)

Apply configured preprocessing to spectra.

Parameters:

Name Type Description Default
options PreprocessingConfig

Pipeline options.

required

Returns:

Name Type Description
SpectralDataset 'SpectralDataset'

New dataset with processed spectra.

Examples:

>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)

roi_spectrum

roi_spectrum(mask)

Average spectrum from a binary ROI mask.

Parameters:

Name Type Description Default
mask ndarray

Boolean/int array of shape shape_xy.

required

Returns:

Name Type Description
SpectralDataset SpectralDataset

Single-row dataset with the average ROI spectrum.

Raises:

Type Description
ValueError

If mask shape mismatches shape_xy.

save_hdf5

save_hdf5(path)

Save hyperspectral cube to HDF5, including ROI artifacts if present.

Parameters:

Name Type Description Default
path str | Path

Destination file path.

required

Raises:

Type Description
ImportError

If h5py is not installed.

segment

segment(method='kmeans', n_clusters=3)

Segment pixels into clusters using kmeans, hierarchical, or NMF.

Parameters:

Name Type Description Default
method str

"kmeans" | "hierarchical" | "nmf".

'kmeans'
n_clusters int

Number of clusters/components.

3

Returns:

Type Description
ndarray

np.ndarray: Label map shape shape_xy.

Raises:

Type Description
ValueError

If method unknown.

to_cube

to_cube()

Reshape spectra back to (y, x, wn) cube using shape_xy.

Returns:

Type Description
ndarray

np.ndarray: Hyperspectral cube.

to_peaks

to_peaks(peaks)

Extract peak features into a wide DataFrame.

Parameters:

Name Type Description Default
peaks Iterable[PeakDefinition]

Peak definitions with names, windows, and modes.

required

Returns:

Type Description
DataFrame

pandas.DataFrame: metadata columns plus one column per peak.

Helper Functions

Conversion Utilities

Return (X, y) arrays suitable for scikit-learn.

Parameters

ds : FoodSpectrumSet Dataset to convert. label_col : Optional[str] Column to use for labels; if None, uses ds.label_col if available.

Returns

(X, y) X shape (n_samples, n_features). y is None if label column not found.

Examples

import numpy as np, pandas as pd ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame({"label": [0, 1]})) X, y = to_sklearn(ds) X.shape, y.tolist() ((2, 3), [0, 1])

Create a FoodSpectrumSet from scikit-learn style inputs.

Parameters

X : np.ndarray Feature matrix of shape (n_samples, n_wavenumbers). y : Optional[Sequence] Optional labels aligned to rows in X. wavenumbers : Sequence[float] Spectral axis values; if empty, uses 0..n_features-1. modality : Modality Modality tag (e.g., 'raman'). labels_name : str Name of the label column in metadata if y is provided.

Returns

FoodSpectrumSet Dataset constructed from the matrix and optional labels.

Examples

import numpy as np ds = from_sklearn(np.ones((2, 4)), y=[0, 1], wavenumbers=[1.0, 2.0, 3.0, 4.0]) ds.wavenumbers.tolist() [1.0, 2.0, 3.0, 4.0]

See Also