Core API¶
Core data structures and workflows for spectral analysis.
The foodspec.core module provides foundational classes for working with spectral data, including dataset containers, single spectrum operations, and result packaging.
Main Classes¶
FoodSpectrumSet¶
Primary container for spectral datasets with aligned metadata.
Collection of spectra with aligned metadata and axis information.
Parameters¶
x :
Array of shape (n_samples, n_wavenumbers) containing spectral intensities.
wavenumbers :
Array of shape (n_wavenumbers,) with the spectral axis values.
metadata :
DataFrame with one row per sample storing labels and acquisition info.
modality :
Spectroscopy modality identifier: "raman", "ftir", or "nir".
batch_ids
property
¶
batch_ids
Return batch identifier column if configured.
Returns:
| Type | Description |
|---|---|
Optional[Series]
|
pandas.Series | None: Batch/run identifiers or None if |
Optional[Series]
|
|
groups
property
¶
groups
Return grouping column if configured.
Returns:
| Type | Description |
|---|---|
Optional[Series]
|
pandas.Series | None: Group identifiers (e.g., folds) or None if |
Optional[Series]
|
|
labels
property
¶
labels
Return label column if configured.
Returns:
| Type | Description |
|---|---|
Optional[Series]
|
pandas.Series | None: Label values aligned to samples, or None if |
Optional[Series]
|
|
Examples:
>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": ["A", "B"]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> ds.labels.tolist()
['A', 'B']
__getitem__ ¶
__getitem__(index)
Return a subset by integer position.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Zero-based row index or slice over samples. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
New dataset containing |
'FoodSpectrumSet'
|
selected by |
Raises:
| Type | Description |
|---|---|
IndexError
|
If an integer index is out of range. |
TypeError
|
If |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.arange(6).reshape(3, 2), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds_sub = ds[1:]
>>> ds_sub.x.shape
(2, 2)
__len__ ¶
__len__()
Number of spectra in the set.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Number of samples (axis 0 of |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(
... x=np.ones((3, 5)),
... wavenumbers=np.arange(5),
... metadata=pd.DataFrame({"label": [0, 1, 0]}),
... )
>>> len(ds)
3
add_metadata_column ¶
add_metadata_column(name, values, *, overwrite=False)
Attach a metadata column aligned with spectra.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Column name to add to |
required |
values
|
Sequence[Any]
|
Iterable of length |
required |
overwrite
|
bool
|
If True, replace an existing column of the same name; otherwise raise. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
New dataset with the added/overwritten column. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If lengths mismatch or column exists and
|
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds2 = ds.add_metadata_column("batch", [1, 2])
>>> ds2.metadata["batch"].tolist()
[1, 2]
apply ¶
apply(func, *, inplace=False)
Apply a vectorized operation to all spectra.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[[ndarray], ndarray]
|
Function that accepts
|
required |
inplace
|
bool
|
If True, modify |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Self (if |
'FoodSpectrumSet'
|
transformed spectra. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the returned array shape differs from |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds2 = ds.apply(lambda arr: arr * 2)
>>> float(ds2.x.mean())
2.0
concat
classmethod
¶
concat(datasets)
Concatenate multiple datasets with shared wavenumber grids.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Sequence[FoodSpectrumSet]
|
Non-empty iterable of
datasets with identical |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Combined dataset with stacked |
'FoodSpectrumSet'
|
concatenated |
|
'FoodSpectrumSet'
|
first dataset. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> import numpy as np, pandas as pd
>>> ds1 = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [0]}))
>>> ds2 = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame({"label": [1, 1]}))
>>> merged = FoodSpectrumSet.concat([ds1, ds2])
>>> merged.x.shape
(3, 2)
copy ¶
copy(deep=True)
Return a copy of the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
deep
|
bool
|
If True, copy arrays/metadata; if False, reuse references (changes mutate the original data). |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Copy with identical content. |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> shallow = ds.copy(deep=False)
>>> shallow.x is ds.x
True
from_hdf5
classmethod
¶
from_hdf5(path, key='foodspec')
Load dataset from HDF5 created by to_hdf5.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
HDF5 file path produced by |
required |
key
|
str
|
Prefix used when saving (default "foodspec"). |
'foodspec'
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Dataset reconstructed from stored arrays and |
'FoodSpectrumSet'
|
metadata. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
Examples:
>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)
>>> _ = FoodSpectrumSet.from_hdf5(tmp.name)
from_parquet
classmethod
¶
from_parquet(path)
Load dataset from Parquet created by to_parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Parquet file written by |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Dataset reconstructed from wide format. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
Examples:
>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)
>>> _ = FoodSpectrumSet.from_parquet(tmp.name)
offset ¶
offset(value, *, inplace=False)
Add a constant offset to spectral intensities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Constant added to every element of |
required |
inplace
|
bool
|
If True, mutate |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Offset dataset (self if |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.zeros((1, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame())
>>> ds.offset(5).x.tolist()
[[5.0, 5.0, 5.0]]
scale ¶
scale(factor, *, inplace=False)
Scale spectral intensities by a factor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
factor
|
float
|
Multiplicative scalar applied to all intensities. |
required |
inplace
|
bool
|
If True, mutate |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Scaled dataset (self if |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> ds.scale(10).x.mean()
10.0
select_wavenumber_range ¶
select_wavenumber_range(min_wn, max_wn)
Return spectra restricted to a wavenumber window.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_wn
|
float
|
Inclusive lower bound of wavenumber window. |
required |
max_wn
|
float
|
Inclusive upper bound of wavenumber window. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Dataset containing columns where |
'FoodSpectrumSet'
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If bounds are inverted or no wavenumbers fall inside the interval. |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.array([500., 750., 1000., 1250.]), metadata=pd.DataFrame())
>>> ds_win = ds.select_wavenumber_range(700, 1100)
>>> ds_win.wavenumbers.tolist()
[750.0, 1000.0]
subset ¶
subset(by=None, indices=None)
Subset by metadata filters and/or explicit indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
by
|
dict[str, Any] | None
|
Column → value filters applied to
|
None
|
indices
|
Sequence[int] | None
|
Explicit zero-based indices to
retain. If both |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
New dataset with selected rows; wavenumbers are |
'FoodSpectrumSet'
|
preserved and metadata reindexed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If requested metadata columns are missing, indices are out of range, or indices are not 1D. |
Examples:
>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0], "split": ["train", "test", "train"]})
>>> ds = FoodSpectrumSet(x=np.ones((3, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> ds_train = ds.subset(by={"split": "train"})
>>> len(ds_train)
2
to_X_y ¶
to_X_y(target_col)
Return (X, y) for a target column in metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_col
|
str
|
Metadata column name to use as labels. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
tuple[np.ndarray, np.ndarray]: |
ndarray
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 4)), wavenumbers=np.arange(4), metadata=meta)
>>> X, y = ds.to_X_y("label")
>>> X.shape, y.tolist()
((2, 4), [0, 1])
to_hdf5 ¶
to_hdf5(path, key='foodspec', mode='w', complevel=4)
Persist dataset to HDF5 (lazy-friendly storage).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination file path. Parent directories must exist. |
required |
key
|
str
|
Prefix for the HDF5 groups created ( |
'foodspec'
|
mode
|
str
|
HDF5 store mode, e.g., |
'w'
|
complevel
|
int
|
Compression level for zlib (0-9). |
4
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the written HDF5 file. |
Examples:
>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".h5", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_hdf5(tmp.name)
to_parquet ¶
to_parquet(path)
Persist dataset to Parquet using wide layout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination parquet path. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the written parquet file. |
Examples:
>>> import numpy as np, pandas as pd, tempfile
>>> tmp = tempfile.NamedTemporaryFile(suffix=".parquet", delete=False)
>>> ds = FoodSpectrumSet(x=np.ones((1, 2)), wavenumbers=np.arange(2), metadata=pd.DataFrame())
>>> _ = ds.to_parquet(tmp.name)
to_wide_dataframe ¶
to_wide_dataframe()
Convert to a wide DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: Metadata columns followed by intensity columns |
DataFrame
|
named |
DataFrame
|
(n_samples, n_metadata + n_wavenumbers). |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1000., 1001., 1002.]), metadata=pd.DataFrame({"label": [0,1]}))
>>> df = ds.to_wide_dataframe()
>>> list(df.columns)[:2]
['label', 'int_1000.0']
train_test_split ¶
train_test_split(
target_col,
test_size=0.3,
stratify=True,
random_state=None,
)
Split into train/test FoodSpectrumSets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_col
|
str
|
Column in |
required |
test_size
|
float
|
Proportion of samples in the test split. |
0.3
|
stratify
|
bool
|
If True, stratify by |
True
|
random_state
|
int | None
|
Seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
'FoodSpectrumSet'
|
tuple[FoodSpectrumSet, FoodSpectrumSet]: |
'FoodSpectrumSet'
|
sharing the original wavenumber grid; metadata is reindexed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"label": [0, 1, 0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((4, 3)), wavenumbers=np.arange(3), metadata=meta)
>>> train, test = ds.train_test_split("label", test_size=0.5, random_state=0)
>>> len(train), len(test)
(2, 2)
validate ¶
validate()
Validate array shapes, wavenumber axis, metadata length, and modality.
Raises:
| Type | Description |
|---|---|
ValueError
|
If shapes mismatch, wavenumbers are non-monotonic, too few points (<3), metadata length mismatches samples, modality is invalid, or configured annotation columns are missing. |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.array([1., 2., 3.]), metadata=pd.DataFrame())
>>> ds.validate() # does not raise
with_annotations ¶
with_annotations(
*, label_col=None, group_col=None, batch_col=None
)
Return a copy with updated label/group/batch annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label_col
|
str | None
|
Name of label column in |
None
|
group_col
|
str | None
|
Name of grouping column (e.g., folds). |
None
|
batch_col
|
str | None
|
Name of batch identifier column. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FoodSpectrumSet |
'FoodSpectrumSet'
|
Copy sharing data/wavenumbers but with annotation |
'FoodSpectrumSet'
|
column names updated (metadata deep-copied). |
Examples:
>>> import numpy as np, pandas as pd
>>> meta = pd.DataFrame({"y": [0, 1]})
>>> ds = FoodSpectrumSet(x=np.ones((2, 2)), wavenumbers=np.arange(2), metadata=meta)
>>> ds2 = ds.with_annotations(label_col="y")
>>> ds2.label_col
'y'
Spectrum¶
Single spectrum data model with validation.
Single spectrum with axis, intensity, units, kind, and metadata.
Represents a single spectroscopic measurement with provenance tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
X-axis (wavenumber/wavelength), shape (n_points,). |
required |
y
|
ndarray
|
Intensity values, shape (n_points,). |
required |
kind
|
Literal['raman', 'ftir', 'nir']
|
Spectroscopy modality. |
required |
x_unit
|
Literal['cm-1', 'nm', 'um', '1/cm']
|
Axis unit. Default 'cm-1'. |
'cm-1'
|
metadata
|
dict
|
Optional metadata (sample_id, instrument, etc.). |
dict()
|
Attributes:
| Name | Type | Description |
|---|---|---|
x |
ndarray
|
X-axis data. |
y |
ndarray
|
Y-axis data. |
kind |
str
|
Modality. |
x_unit |
str
|
Unit of x-axis. |
metadata |
dict
|
Validated metadata. |
config_hash |
str
|
Hash of metadata for reproducibility tracking. |
config_hash
property
¶
config_hash
Hash of metadata for reproducibility tracking.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
First 8 hex chars of SHA256 over metadata JSON. |
n_points
property
¶
n_points
Number of spectral points.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Length of |
copy ¶
copy()
Return a deep copy of this spectrum.
Returns:
| Name | Type | Description |
|---|---|---|
Spectrum |
Spectrum
|
Independent copy. |
crop_wavenumber ¶
crop_wavenumber(x_min, x_max)
Crop spectrum to a wavenumber/wavelength range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x_min
|
float
|
Minimum axis value. |
required |
x_max
|
float
|
Maximum axis value. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Spectrum |
Spectrum
|
New spectrum with cropped data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the range contains no points. |
OutputBundle¶
Structured output packaging for analysis results.
Unified container for workflow outputs: metrics, diagnostics, provenance, artifacts.
Manages the triple output (metrics + diagnostics + provenance) and exports to disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_record
|
RunRecord
|
Provenance record for the workflow. |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
dict
|
Quantitative results (accuracy, F1, RMSE, etc.). |
diagnostics |
dict
|
Plots and tables (confusion matrix, feature importance, etc.). |
artifacts |
dict
|
Portable exports (model, preprocessor, etc.). |
run_record |
RunRecord
|
Provenance. |
add_artifact ¶
add_artifact(name, value)
Add an artifact (model, preprocessor, scaler, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Artifact name (e.g., "model"). |
required |
value
|
Any
|
Artifact object. |
required |
add_diagnostic ¶
add_diagnostic(name, value)
Add a diagnostic (plot, table, figure).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Diagnostic name (e.g., "confusion_matrix"). |
required |
value
|
Any
|
Diagnostic (Figure, ndarray, DataFrame, dict, str). |
required |
add_metrics ¶
add_metrics(name, value)
Add a metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Metric name (e.g., "accuracy"). |
required |
value
|
Any
|
Metric value (number, array, DataFrame). |
required |
export ¶
export(output_dir, formats=None)
Export bundle to disk.
Exports: - metrics.json - diagnostics/ (plots as PNG/PDF, tables as CSV) - artifacts/ (models as joblib/pickle) - provenance.json (run_record)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Output directory. |
required |
formats
|
list[str] | None
|
Export formats. Default: ["json", "csv", "png", "joblib"]. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Output directory path. |
summary ¶
summary()
Generate human-readable summary of outputs.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Summary string. |
RunRecord¶
Provenance tracking for reproducible analyses.
Immutable record of a workflow execution with full provenance.
Tracks configuration, dataset hash, step history, environment, timing, and user info.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow_name
|
str
|
Name of the workflow (e.g., "oil_authentication"). |
required |
config
|
dict
|
Configuration parameters. |
required |
dataset_hash
|
str
|
SHA256 hash of input dataset. |
required |
environment
|
dict | None
|
Environment info (Python version, package versions, etc.). |
dict()
|
step_records
|
list[dict] | None
|
Step records: {"name","hash","timestamp","error"}. |
list()
|
user
|
str | None
|
User who ran the workflow. |
None
|
notes
|
str | None
|
Freeform notes. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
workflow_name |
str
|
Name of the workflow. |
config |
Dict[str, Any]
|
Configuration parameters. |
config_hash |
str
|
SHA256 hash of config. |
dataset_hash |
str
|
SHA256 hash of input dataset. |
environment |
Dict[str, Any]
|
Environment info (Python version, packages, etc.). |
step_records |
List[Dict[str, Any]]
|
Step records with name, hash, timestamp, error. |
user |
Optional[str]
|
User who ran the workflow. |
notes |
Optional[str]
|
Freeform notes. |
timestamp |
str
|
ISO 8601 timestamp (UTC). |
run_id |
str
|
Unique run identifier. |
combined_hash
property
¶
combined_hash
Combined hash of config + dataset + all steps.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
First 8 hex chars of combined SHA256. |
config_hash
property
¶
config_hash
Hash of configuration.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
First 8 hex chars of SHA256 over config JSON. |
run_id
property
¶
run_id
Unique run identifier combining workflow name and timestamp.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Deterministic identifier for the run. |
add_output_path ¶
add_output_path(path)
Record an output path for the run (e.g., exported bundle location).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Output directory/file path. |
required |
add_step ¶
add_step(name, step_hash, error=None, metadata=None)
Record a workflow step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Step name (e.g., "baseline_correction"). |
required |
step_hash
|
str
|
Hash of step output or configuration. |
required |
error
|
str | None
|
Error message if step failed. |
None
|
metadata
|
dict | None
|
Additional metadata for this step. |
None
|
from_json
classmethod
¶
from_json(path)
Load from JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Input file path. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
RunRecord |
RunRecord
|
Deserialized record. |
to_dict ¶
to_dict()
Serialize to dictionary.
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dict[str, Any]
|
JSON-serializable representation of the run record. |
to_json ¶
to_json(path)
Write to JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Output file path. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to written file. |
Advanced Data Structures¶
SpectralDataset¶
Extended dataset with preprocessing pipeline integration.
Matrix-form spectra with aligned metadata and instrument info.
Attributes:
| Name | Type | Description |
|---|---|---|
wavenumbers |
ndarray
|
Axis values (cm^-1). |
spectra |
ndarray
|
Shape (n_samples, n_points) intensities. |
metadata |
DataFrame
|
Sample annotations. |
instrument_meta |
dict
|
Instrument/protocol metadata. |
logs |
list[str]
|
Operation logs. |
history |
list[dict]
|
Preprocessing steps applied. |
copy ¶
copy()
Deep copy of dataset arrays, metadata, and meta fields.
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
'SpectralDataset'
|
Independent copy. |
from_hdf5
staticmethod
¶
from_hdf5(path, *, allow_future=False)
Load dataset from HDF5, supporting legacy layout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
HDF5 file path. |
required |
allow_future
|
bool
|
If True, tolerate newer minor schema. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
'SpectralDataset'
|
Reconstructed dataset. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
ValueError
|
If schema version incompatible. |
preprocess ¶
preprocess(options)
Apply configured preprocessing to spectra.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
PreprocessingConfig
|
Pipeline options. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
'SpectralDataset'
|
New dataset with processed spectra. |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)
save_hdf5 ¶
save_hdf5(path)
Save dataset to HDF5 with schema versioning and legacy keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination file path. |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
to_peaks ¶
to_peaks(peaks)
Extract peak features into a wide DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peaks
|
Iterable[PeakDefinition]
|
Peak definitions with names, windows, and modes. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: |
HyperspectralDataset¶
Hyperspectral imaging (3D spatial-spectral) data container.
Bases: SpectralDataset
Hyperspectral cube flattened to spectra with spatial tracking.
The cube has shape (y, x, wn) but is stored as spectra with shape
(n_pixels, wn); spatial dimensions are given by shape_xy.
copy ¶
copy()
Deep copy of dataset arrays, metadata, and meta fields.
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
'SpectralDataset'
|
Independent copy. |
from_cube
staticmethod
¶
from_cube(
cube, wavenumbers, metadata, instrument_meta=None
)
Construct flattened dataset from a hyperspectral cube.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cube
|
ndarray
|
Array shape (y, x, wn). |
required |
wavenumbers
|
ndarray
|
Axis values. |
required |
metadata
|
DataFrame
|
Sample/site metadata. |
required |
instrument_meta
|
dict | None
|
Instrument details. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
HyperspectralDataset |
'HyperspectralDataset'
|
Flattened dataset with |
from_hdf5
staticmethod
¶
from_hdf5(path, *, allow_future=False)
Load hyperspectral dataset from HDF5, supporting legacy layout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
HDF5 path. |
required |
allow_future
|
bool
|
If True, tolerate newer minor schema. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
HyperspectralDataset |
'HyperspectralDataset'
|
Reconstructed dataset with |
'HyperspectralDataset'
|
optional ROI assets. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
ValueError
|
If schema version incompatible. |
preprocess ¶
preprocess(options)
Apply configured preprocessing to spectra.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
PreprocessingConfig
|
Pipeline options. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
'SpectralDataset'
|
New dataset with processed spectra. |
Examples:
>>> import numpy as np, pandas as pd
>>> ds = SpectralDataset(np.arange(3.), np.ones((2,3)), pd.DataFrame())
>>> cfg = PreprocessingConfig(baseline_method="none", smoothing_method="none", normalization="none", spike_removal=False)
>>> ds2 = ds.preprocess(cfg)
>>> ds2.spectra.shape
(2, 3)
roi_spectrum ¶
roi_spectrum(mask)
Average spectrum from a binary ROI mask.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask
|
ndarray
|
Boolean/int array of shape |
required |
Returns:
| Name | Type | Description |
|---|---|---|
SpectralDataset |
SpectralDataset
|
Single-row dataset with the average ROI spectrum. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
save_hdf5 ¶
save_hdf5(path)
Save hyperspectral cube to HDF5, including ROI artifacts if present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination file path. |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
segment ¶
segment(method='kmeans', n_clusters=3)
Segment pixels into clusters using kmeans, hierarchical, or NMF.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
"kmeans" | "hierarchical" | "nmf". |
'kmeans'
|
n_clusters
|
int
|
Number of clusters/components. |
3
|
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Label map shape |
Raises:
| Type | Description |
|---|---|
ValueError
|
If method unknown. |
to_cube ¶
to_cube()
Reshape spectra back to (y, x, wn) cube using shape_xy.
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Hyperspectral cube. |
to_peaks ¶
to_peaks(peaks)
Extract peak features into a wide DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peaks
|
Iterable[PeakDefinition]
|
Peak definitions with names, windows, and modes. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: |
Helper Functions¶
Conversion Utilities¶
Return (X, y) arrays suitable for scikit-learn.
Parameters¶
ds : FoodSpectrumSet Dataset to convert. label_col : Optional[str] Column to use for labels; if None, uses ds.label_col if available.
Returns¶
(X, y) X shape (n_samples, n_features). y is None if label column not found.
Examples¶
import numpy as np, pandas as pd ds = FoodSpectrumSet(x=np.ones((2, 3)), wavenumbers=np.arange(3), metadata=pd.DataFrame({"label": [0, 1]})) X, y = to_sklearn(ds) X.shape, y.tolist() ((2, 3), [0, 1])
Create a FoodSpectrumSet from scikit-learn style inputs.
Parameters¶
X : np.ndarray Feature matrix of shape (n_samples, n_wavenumbers). y : Optional[Sequence] Optional labels aligned to rows in X. wavenumbers : Sequence[float] Spectral axis values; if empty, uses 0..n_features-1. modality : Modality Modality tag (e.g., 'raman'). labels_name : str Name of the label column in metadata if y is provided.
Returns¶
FoodSpectrumSet Dataset constructed from the matrix and optional labels.
Examples¶
import numpy as np ds = from_sklearn(np.ones((2, 4)), y=[0, 1], wavenumbers=[1.0, 2.0, 3.0, 4.0]) ds.wavenumbers.tolist() [1.0, 2.0, 3.0, 4.0]
See Also¶
- IO Module - Loading and saving spectral data
- Preprocessing - Data cleaning methods
- Examples - Practical usage examples