Preprocessing: Feature Extraction (Peaks, Bands, Ratios, Fingerprints)¶
Feature extraction turns preprocessed spectra into quantitative descriptors for chemometrics and ML. This chapter covers peaks, band integration, ratios, and fingerprint similarity.
What?¶
Extract peaks, bands, and ratios from preprocessed spectra to create interpretable features. Inputs: preprocessed spectra + wavenumbers + expected peaks/bands. Outputs: feature tables (heights, areas, ratios), plus peak/ratio summaries by group.
Why?¶
Vibrational bands encode chemistry (unsaturation, carbonyls, proteins). Peaks/ratios normalize scaling, highlight relative composition, and feed PCA/PLS/ML with interpretable variables rooted in Raman/FTIR physics.
When?¶
Use when you need chemically meaningful features for classification, regression, or QC; when literature bands are known.
Limitations: noisy/uncorrected baselines bias heights; overlapping peaks reduce interpretability; poorly chosen bands/ratios risk “data dredging.”
Where? (pipeline)¶
Upstream: baseline → smoothing → normalization → crop.
Downstream: stats (ANOVA/Games–Howell/Kruskal), effect sizes, ML/PLS, reporting tables.
flowchart LR
A[Preprocessed spectra] --> B[Peaks/Bands/Ratios]
B --> C[Peak/ratio summaries]
C --> D[Stats + ML + Plots]
Why this matters in food spectroscopy¶
- Peaks and bands correspond to chemical groups; ratios and integrals summarize composition changes (e.g., unsaturation vs carbonyl, oxidation markers).
- Quantitative features feed PCA/PLS/ML; poor feature choices limit interpretability and performance.
1. Peaks¶
- What: Locate local maxima near expected wavenumbers; measure height and area.
- Why: Specific functional groups (e.g., C=C at ~1655 cm⁻¹, C=O at ~1742 cm⁻¹) are diagnostic for oils; amide bands for proteins.
- How (user-defined): You specify the peaks you care about via
expected_peaksand atolerance(window, e.g., ±6–10 cm⁻¹). The extractor searches only within that window to find the nearest local maximum, which accounts for small shifts from calibration, temperature, matrix effects, or preprocessing. Heights/areas are then computed in that window. - Pitfalls: Overlapping peaks; noisy derivatives; baseline errors can bias heights—ensure preprocessing is solid.
2. Band integration¶
- What: Integrate intensity over defined ranges to capture broader features.
- Why: Useful for broad NIR bands or overlapping FTIR features; robust to small shifts.
- How: Define
(label, min_wn, max_wn); integrate numerically. - Pitfalls: Include baseline if not corrected; window choice matters.
3. Ratios¶
- What: Ratios of peak heights/areas or band integrals (e.g., 1655/1742 for unsaturation vs carbonyl).
- Why: Normalize out scaling, emphasize relative chemistry; common in authenticity and degradation studies.
- How: Specify numerator/denominator feature names; handle zeros/NaNs carefully.
- Pitfalls: Ratios amplify noise when denominators are small; ensure peaks are present.
Peak and ratio choices for food spectroscopy¶
- Chemical meaning first: anchor peaks/bands to known vibrational assignments from literature.
- Illustrative oil examples (Raman):
- Peaks: ~1655 cm⁻¹ (C=C stretch, unsaturation), ~1440 cm⁻¹ (CH₂ bend), ~1265 cm⁻¹ (cis =C–H bend), ~717 cm⁻¹ (C–C stretch).
- Ratios: 1655/1440 (unsaturation vs saturation), 1655/717 (degree of unsaturation), 1265/1440 (cis content vs saturated backbone). Higher unsaturation ratios generally indicate more double bonds.
- Identification: detect local maxima near user-specified positions ± tolerance (e.g., 6–10 cm⁻¹) to accommodate small shifts, or integrate fixed windows (e.g., 1650–1665 cm⁻¹) for robustness. Users control the expected peaks and tolerances; the extractor confines the search to that window.
- Interpretation: track ratio changes across classes or treatments (e.g., heating decreases 1655/1440 as unsaturation drops). Use either the example oils dataset or a synthetic Raman spectrum from
generate_synthetic_raman_spectrumto illustrate the labeled peaks and ratios.
Assessing peak stability across spectra/groups¶
- Goal: quantify how stable peak positions and intensities are across replicates or groups (e.g., batches, treatments).
- Procedure: run
PeakFeatureExtractoron all spectra; compute mean/SD (or CV) of detected peak positions and intensities per group using pandas; optionally run ANOVA/Kruskal on intensities to test group differences. - Interpretation: low SD in position suggests good spectral alignment; low CV in intensity implies stable measurement/preprocessing; large shifts or CV may indicate instrument drift, poor preprocessing, or real chemical changes.
- Tools: combine peak outputs with stats utilities (e.g.,
run_anova,run_kruskal_wallis) for intensity comparison across groups.
4. Fingerprint similarity¶
- What: Cosine/correlation similarity between spectra or feature vectors.
- Why: Library search, QC thresholds, clustering; supports novelty detection or matching to references.
- How: Use cosine/correlation matrices; interpret high similarity as close matches.
- Pitfalls: Sensitive to preprocessing differences; ensure consistent normalization and cropping.
5. Example (high level)¶
from foodspec.features.peaks import PeakFeatureExtractor
from foodspec.features.ratios import RatioFeatureGenerator
extractor = PeakFeatureExtractor(expected_peaks=[1655.0, 1742.0], tolerance=8.0)
peak_feats = extractor.fit_transform(X_proc, wavenumbers=wn)
ratio_gen = RatioFeatureGenerator({"ratio_1655_1742": ("peak_1655.0_height", "peak_1742.0_height")})
ratios = ratio_gen.transform(peak_feats)
6. Visuals to include¶
- Annotated spectrum showing expected peaks and integration windows.

Figure: Synthetic spectrum with peaks marked and a shaded band area for integration (e.g., ratio or area feature).
- Histogram of a key ratio by class (oil_type) or time point.
- Similarity matrix heatmap for library search/QC.
How to choose bands and ratios (decision mini-guide)¶
Guidance, not a rigid rule. Always validate against your matrix and literature.
Common questions → suggested bands/ratios - Distinguishing oils/fat composition: - Bands: ~1655 cm⁻¹ (C=C stretch, unsaturation), ~1742 cm⁻¹ (C=O ester), ~1450 cm⁻¹ (CH₂ bend). - Ratios: 1655/1742 (unsaturation), 1450/1655 (saturation vs unsaturation balance). - Why: Unsaturation and carbonyl content differ by oil type. See Oil authentication. - Tracking oxidation/degradation (heating): - Bands: decrease at ~1655 cm⁻¹ (C=C), increase near ~1710–1740 cm⁻¹ (oxidation carbonyls). - Ratios: 1655/1740 (unsaturation loss), 1450/1655 (relative saturation), optional 3010/2850 (CH stretch region if available). - Why: Unsaturation decreases; carbonyls grow with oxidation. See Heating quality monitoring. - Adulteration with known adulterant: - Bands: choose unique marker of adulterant (e.g., specific carbonyl or aromatic band). - Ratios: marker band / stable reference band (e.g., 1742/1655 if adulterant rich in carbonyl). - Why: Highlight component unique to adulterant; validate against pure references. - Batch QC comparison: - Bands: stable backbone bands (e.g., 1450 cm⁻¹ CH₂) vs marker band of interest. - Ratios: marker/reference to detect drift; use control limits. - Why: Quick screening for drift/outliers in production.
Code sketch (ratios + boxplot)
import pandas as pd
import seaborn as sns
from foodspec.data.loader import load_example_oils
from foodspec.features.ratios import compute_ratios
fs = load_example_oils()
# assumes peak heights already extracted into metadata/features
ratio_def = {"unsat_ratio": ("peak_1655.0_height", "peak_1742.0_height")}
ratio_df = compute_ratios(fs.metadata, ratio_def)
ratio_df["oil_type"] = fs.metadata["oil_type"].values
sns.boxplot(data=ratio_df, x="oil_type", y="unsat_ratio")
Summarizing Peaks and Ratios Across Groups¶
- Peak stats: mean ± std of peak position and intensity indicate alignment/stability (position) and consistency/variation (intensity). Shifts may reflect chemistry (e.g., oxidation) or preprocessing/instrument drift.
- Ratio tables: mean ± std (and n) of key ratios by group (oil_type, batch, treatment) give a compact view for reports/supplementary tables.
Example (peak stats + ratio tables)¶
import pandas as pd
from foodspec.features import PeakFeatureExtractor, RatioFeatureGenerator, compute_peak_stats, compute_ratio_table
from foodspec.data.loader import load_example_oils
fs = load_example_oils()
extractor = PeakFeatureExtractor(expected_peaks=[1655.0, 1742.0], tolerance=8.0)
peak_df = extractor.fit_transform(fs.x, wavenumbers=fs.wavenumbers)
peak_df["spectrum_id"] = fs.metadata.index
ratio_gen = RatioFeatureGenerator({"ratio_1655_1742": ("peak_1655.0_height", "peak_1742.0_height")})
ratio_df = ratio_gen.transform(peak_df)
peak_summary = compute_peak_stats(peak_df, metadata=fs.metadata, group_keys=["oil_type"])
ratio_summary = compute_ratio_table(ratio_df.join(fs.metadata[["oil_type"]]), metadata=fs.metadata, group_keys=["oil_type"])
print(peak_summary.head())
print(ratio_summary.head())
Cross-links - Vibrational assignments: Spectroscopy basics. - Workflows using these ratios: Oil authentication, Heating quality monitoring. - Chemometric context: Chemometrics guide.
Summary¶
- Peaks capture localized functional groups; bands capture broader features; ratios emphasize relative chemistry; fingerprints support similarity/QC tasks.
- Good preprocessing (baseline, normalization) is critical before feature extraction.
- Choose features to match the scientific question: authentication (peaks/ratios), mixtures (areas, similarity), degradation (time-trend ratios).
When Results Cannot Be Trusted¶
⚠️ Red flags for feature extraction validity:
- Peak locations assumed fixed across samples (peak at 1742 cm⁻¹ in all oils; no variation checking)
- Peak positions shift with sample composition, temperature, and state
- Fixed peak windows may miss or capture wrong peaks
-
Fix: Estimate peak positions per sample; allow small range; visualize peaks in all samples
-
Features extracted from noisy regions (very low SNR peaks treated as signals)
- Noisy features have high fold-to-fold variability
- Unstable for modeling
-
Fix: Check SNR per feature; exclude features with SNR < 10; use multiple replicates
-
Ratio denominators close to zero (ratio A/B where B is tiny; ratio inflated by noise)
- Division by small numbers amplifies denominator noise
- Ratio unstable
-
Fix: Ensure denominators > 10% of range; avoid ratios of weak features; use robust statistics
-
Features selected post-hoc to separate groups (choosing peaks after seeing group differences)
- Data-dependent feature selection overfits; features don't generalize
- Reproducibility compromised
-
Fix: Select features a priori based on chemistry or previous literature; or use statistical selection within cross-validation
-
Feature scaling/normalization not reported (unclear whether features are absolute intensities, ratios, or scaled)
- Downstream models depend on feature scale
- Reproducibility requires documentation
-
Fix: Document feature definitions and units; report any scaling applied
-
Overlapping peaks not resolved (assuming single peak in region where two peaks overlap)
- Feature represents mixture of compounds
- Chemical interpretation ambiguous
-
Fix: Use deconvolution or higher-resolution spectra; use multiple features from overlapping regions
-
Features correlated but both used in model (multicollinearity undetected)
- Correlated features inflate coefficients; model unstable
- Importance scores unreliable
-
Fix: Check feature correlations; remove or combine correlated features; use regularization (PCA, Lasso)
-
Feature robustness not tested across batches/instruments/times
- Features may be batch-dependent artifacts, not true signals
- Models trained on feature artifacts don't generalize
- Fix: Validate features across multiple batches/instruments; check feature stability over time
When to Use¶
Use peak extraction when:
- Known spectral markers: Literature identifies specific peaks for your application (e.g., 1655 cm⁻¹ for unsaturation)
- Interpretable features needed: Regulatory or publication requirements for chemically meaningful descriptors
- Targeted analysis: Specific functional groups of interest (carbonyl, unsaturation, protein content)
- QC monitoring: Tracking specific peak heights/ratios as process indicators
- Small feature sets: Need compact representation (5-20 features) instead of full spectra
Use band integration when:
- Broad features: NIR overtones or combination bands spanning 50+ cm⁻¹
- Overlapping peaks: Multiple unresolved peaks in region of interest
- Robust quantification: Area measurements less sensitive to peak position shifts than heights
Use ratios when:
- Normalization needed: Removing instrument/path length variations
- Relative composition: Tracking unsaturation/saturation balance, oxidation markers
- Authenticity testing: Reference ratios characteristic of genuine samples
- Simple interpretability: Single ratio metric for decisions (QC pass/fail)
When NOT to Use (Common Failure Modes)¶
Avoid peak extraction when:
- Unknown chemistry: No literature guidance on relevant peaks; exploratory analysis needed
- Dense overlapping spectra: Too many overlapping peaks; whole-spectrum methods (PCA) more appropriate
- High baseline curvature: Uncorrected baselines bias peak heights
- Peak-free regions: Expecting peaks in regions with no real signal
- Very low resolution: Spectral resolution insufficient to resolve expected peaks
Avoid band integration when:
- Sharp narrow peaks: Integration includes too much baseline/noise relative to peak
- Variable baseline: Uncorrected baseline contributes significantly to integrated area
- Well-separated peaks: Peak heights sufficient; integration adds complexity
Avoid ratios when:
- Weak denominator: Denominator peak near noise floor; ratio unstable
- Absolute quantification needed: Ratios remove calibration to concentration
- Denominator varies independently: Denominator not truly stable reference
- Complex interpretation: Multiple competing chemical effects make ratio ambiguous
Recommended Defaults¶
For peak detection (Raman oils):
from foodspec.features import detect_peaks
# Common oil peaks
peaks = detect_peaks(
X,
wavenumbers=wn,
expected_peaks=[1655, 1742, 1440, 1265, 717],
tolerance=10, # ±10 cm⁻¹ search window
min_snr=5 # Minimum signal-to-noise ratio
)
tolerance=10: Accounts for instrument calibration and matrix effects
- min_snr=5: Ensures detected peaks above noise floor
For band integration:
from foodspec.features import compute_band_features
# Integrate broad regions
bands = compute_band_features(
X,
wavenumbers=wn,
band_definitions=[
("unsaturation", 1650, 1665),
("carbonyl", 1735, 1750),
("CH2_bend", 1435, 1450)
]
)
For ratio calculation:
from foodspec.features import RatioQualityEngine
# Define chemistry-based ratios
rq_engine = RatioQualityEngine(
ratios=[
("unsat_ratio", "peak_1655", "peak_1742"), # Unsaturation
("oxidation", "peak_1742", "peak_1440") # Carbonyl/backbone
],
min_denominator=0.01 # Avoid division by near-zero
)
results = rq_engine.compute(peak_features)
min_denominator=0.01: Guards against unstable ratios
For fingerprint similarity:
from foodspec.features import similarity_search
# Compare spectra to reference library
matches = similarity_search(
query=X_unknown,
library=X_reference,
metric="cosine",
top_k=5
)
metric="cosine": Normalized similarity, robust to scaling
See Also¶
API Reference:
- detect_peaks - Automated peak finding
- compute_band_features - Band integration
- RatioQualityEngine - Ratio calculation framework
- similarity_search - Spectral matching
Related Methods:
- Baseline Correction - Essential preprocessing before peak extraction
- Normalization & Smoothing - Stabilize peak measurements
- Derivatives - Enhanced peak detection
- PCA - Alternative to manual feature selection
Examples:
- Oil Authentication Quickstart - Peak-based classification
- Heating Quality Monitoring - Ratio tracking over time
- Mixture Analysis - Band integration for concentration