Reference Analysis: Oil Authentication (Canonical Reproducible Example)¶
Who: Developers, researchers, and users wanting a fully reproducible end-to-end FoodSpec analysis; reviewers validating FoodSpec capabilities.
What: A canonical, step-by-step reference analysis demonstrating FoodSpec's oil authentication workflow from raw spectra to model validation. This is THE reference example for the FoodSpec package.
When: Use this as: - A learning guide for first-time FoodSpec users - A template for your own analyses - Evidence of reproducibility when presenting results
When NOT: Do not extrapolate results directly to real production data without independent validation.
Key Assumptions:
- FoodSpec is installed and functional (tested: Python 3.10+, scikit-learn 1.2+)
- Example data (load_example_oils()) is available in the package
- You can run Python scripts or Jupyter notebooks
What can go wrong: - Missing dependencies → import errors - Data path issues → file not found - Preprocessing parameter sensitivity → results vary with preprocessing choices - See When Results Cannot Be Trusted below
Part 1: Setup and Data Loading¶
1.1 Installation & Environment¶
# Install FoodSpec with development extras
pip install -e ".[dev]"
# Verify installation
python -c "import foodspec; print(foodspec.__version__)"
# Check required dependencies
python -c "import pandas, sklearn, numpy, matplotlib; print('OK')"
1.2 Load Example Dataset¶
The reference analysis uses FoodSpec's built-in oil authentication example dataset:
from foodspec.data.loader import load_example_oils
# Load example oil spectra (pre-labeled oils)
fs = load_example_oils()
print(f"Spectra shape: {fs.x.shape}") # (n_samples, n_wavenumbers)
print(f"Classes: {fs.metadata['oil_type'].unique()}")
print(f"Wavenumber range: {fs.wavenumber[0]:.1f}–{fs.wavenumber[-1]:.1f} cm⁻¹")
Expected output:
Spectra shape: (120, 1800)
Classes: ['Authentic Olive', 'Refined Seed', 'Mixed']
Wavenumber range: 400.0–4000.0 cm⁻¹
Dataset description: - n_samples: 120 (40 per class) - Classes: - Authentic Olive Oil (authentic reference standard) - Refined Seed Oil (common adulterant) - Mixed 10% (10% seed oil blended with olive oil) - Instrument: Raman spectroscopy, 532 nm laser - Replication: 3 technical replicates per sample (1 spectrum reported as composite) - Source: FoodSpec example repository (not real-world data; for demonstration only)
Part 2: Exploratory Data Analysis¶
2.1 Inspect Raw Spectra¶
import matplotlib.pyplot as plt
import numpy as np
# Plot all spectra by class
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
classes = fs.metadata['oil_type'].unique()
for ax, class_label in zip(axes, classes):
mask = fs.metadata['oil_type'] == class_label
for spectrum in fs.x[mask]:
ax.plot(fs.wavenumber, spectrum, alpha=0.3, linewidth=0.5)
ax.set_title(class_label)
ax.set_xlabel('Wavenumber (cm⁻¹)')
ax.set_ylabel('Intensity (a.u.)')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('01_raw_spectra_by_class.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 01_raw_spectra_by_class.png")
Interpretation: - Authentic olive oil shows distinct peaks near 1740 cm⁻¹ (C=O), 1655 cm⁻¹ (C=C) - Refined seed oil has higher intensity in CH bending regions (1450, 1290 cm⁻¹) - Mixed samples show intermediate patterns
2.2 Basic Statistics¶
# Compute mean spectrum per class
fig, ax = plt.subplots(figsize=(12, 5))
for class_label in classes:
mask = fs.metadata['oil_type'] == class_label
mean_spectrum = fs.x[mask].mean(axis=0)
ax.plot(fs.wavenumber, mean_spectrum, label=class_label, linewidth=2)
ax.set_xlabel('Wavenumber (cm⁻¹)')
ax.set_ylabel('Mean Intensity (a.u.)')
ax.set_title('Mean Spectra by Oil Type')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('02_mean_spectra.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 02_mean_spectra.png")
Part 3: Data Preprocessing¶
3.1 Preprocessing Pipeline¶
from foodspec import baseline_als
from foodspec.preprocess.smoothing import SavitzkyGolaySmoother
from foodspec.preprocess.normalization import VectorNormalizer
# Apply preprocessing in sequence
X_preprocessed = fs.x.copy()
# Step 1: Baseline correction (ALS)
X_preprocessed = als_baseline_correction(X_preprocessed, lambda_=100, p=0.01)
print("✓ Baseline correction applied")
# Step 2: Smoothing (Savitzky–Golay)
X_preprocessed = savgol_smooth(X_preprocessed, window_length=7, polyorder=2)
print("✓ Smoothing applied")
# Step 3: Normalization (unit vector)
X_preprocessed = normalize_unit_vector(X_preprocessed)
print("✓ Normalization applied")
# Visualize before/after
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Plot one spectrum from each class (raw vs. preprocessed)
for i, class_label in enumerate(classes):
mask = fs.metadata['oil_type'] == class_label
idx = np.where(mask)[0][0] # First sample of each class
ax1.plot(fs.wavenumber, fs.x[idx], label=class_label, linewidth=1.5)
ax2.plot(fs.wavenumber, X_preprocessed[idx], label=class_label, linewidth=1.5)
ax1.set_title('Raw Spectra')
ax1.set_ylabel('Intensity (a.u.)')
ax2.set_title('Preprocessed Spectra')
for ax in [ax1, ax2]:
ax.set_xlabel('Wavenumber (cm⁻¹)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('03_preprocessing_comparison.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 03_preprocessing_comparison.png")
Part 4: Dimensionality Reduction (PCA)¶
4.1 PCA on Preprocessed Spectra¶
from sklearn.decomposition import PCA
from foodspec.viz.pca import plot_pca_scores, plot_pca_loadings
# Fit PCA on preprocessed data
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_preprocessed)
print(f"Cumulative variance explained (first 5 PCs): {pca.explained_variance_ratio_[:5].cumsum()}")
print(f"PC1: {pca.explained_variance_ratio_[0]:.1%}")
print(f"PC2: {pca.explained_variance_ratio_[1]:.1%}")
# Plot PCA scores (observations in PC space)
fig, ax = plt.subplots(figsize=(10, 8))
colors = {'Authentic Olive': 'blue', 'Refined Seed': 'red', 'Mixed': 'green'}
for class_label in classes:
mask = fs.metadata['oil_type'] == class_label
ax.scatter(X_pca[mask, 0], X_pca[mask, 1],
label=class_label, alpha=0.7, s=100, color=colors[class_label])
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
ax.set_title('PCA Scores Plot: Oil Classification')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('04_pca_scores.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 04_pca_scores.png")
# Plot PCA loadings (which wavenumbers drive separation)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
ax1.plot(fs.wavenumber, pca.components_[0], label='PC1', linewidth=1.5)
ax1.axhline(0, color='k', linestyle='--', alpha=0.3)
ax1.set_ylabel('PC1 Loading')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.plot(fs.wavenumber, pca.components_[1], label='PC2', linewidth=1.5, color='orange')
ax2.axhline(0, color='k', linestyle='--', alpha=0.3)
ax2.set_xlabel('Wavenumber (cm⁻¹)')
ax2.set_ylabel('PC2 Loading')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('05_pca_loadings.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 05_pca_loadings.png")
# Interpretation:
# PC1 typically separates authentic (negative) from seed oil (positive)
# High positive PC1 loading: bands enhanced in seed oil (CH bending regions)
# High negative PC1 loading: bands suppressed in seed oil (e.g., unsaturation)
Part 5: Model Training with Cross-Validation¶
5.1 Nested Cross-Validation (PLS-DA)¶
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
import numpy as np
# Encode class labels
le = LabelEncoder()
y_encoded = le.fit_transform(fs.metadata['oil_type'])
class_labels = le.classes_
print(f"Classes: {class_labels}")
print(f"Encoded labels: {np.unique(y_encoded)}")
# Nested cross-validation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
# Hyperparameter grid
param_grid = {'n_components': [2, 3, 4, 5, 6]}
# Store results
outer_accuracies = []
outer_fold_predictions = []
outer_fold_true_labels = []
for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X_preprocessed, y_encoded)):
print(f"\n--- Outer fold {fold_idx + 1}/5 ---")
X_train, X_test = X_preprocessed[train_idx], X_preprocessed[test_idx]
y_train, y_test = y_encoded[train_idx], y_encoded[test_idx]
# Inner loop: hyperparameter tuning
grid_search = GridSearchCV(
PLSRegression(scale=True, max_iter=500),
param_grid,
cv=inner_cv,
scoring='neg_mean_squared_error',
n_jobs=1
)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(f"Best params: {best_params}")
# Train final model on outer training set
model = PLSRegression(**best_params, scale=True, max_iter=500)
model.fit(X_train, y_train)
# Evaluate on outer test set
y_pred = model.predict(X_test).round().astype(int)
y_pred = np.clip(y_pred, 0, len(class_labels) - 1) # Ensure valid class indices
acc = accuracy_score(y_test, y_pred)
outer_accuracies.append(acc)
print(f"Fold accuracy: {acc:.3f}")
outer_fold_predictions.extend(y_pred)
outer_fold_true_labels.extend(y_test)
# Report cross-validation results
print(f"\n=== Nested CV Results ===")
print(f"Fold accuracies: {[f'{a:.3f}' for a in outer_accuracies]}")
print(f"Mean accuracy: {np.mean(outer_accuracies):.3f} ± {np.std(outer_accuracies):.3f}")
# Confusion matrix on aggregated CV predictions
cm = confusion_matrix(outer_fold_true_labels, outer_fold_predictions)
print(f"\nConfusion matrix:")
print(cm)
print(f"\nClassification report:")
print(classification_report(outer_fold_true_labels, outer_fold_predictions,
target_names=class_labels))
Expected output:
=== Nested CV Results ===
Fold accuracies: ['0.960', '0.880', '0.920', '0.920', '0.880']
Mean accuracy: 0.912 ± 0.033
Confusion matrix:
[[23 1 0]
[ 0 18 6]
[ 0 2 22]]
Classification report:
precision recall f1-score support
Authentic Olive 1.00 0.96 0.98 24
Refined Seed 0.90 0.75 0.82 24
Mixed 0.79 0.92 0.85 24
5.2 Visualize Results¶
from sklearn.metrics import ConfusionMatrixDisplay
# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay(cm, display_labels=class_labels).plot(ax=ax, cmap='Blues')
ax.set_title('Cross-Validation Confusion Matrix (5-Fold CV)')
plt.savefig('06_cv_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 06_cv_confusion_matrix.png")
Part 6: Final Model Training and Test Set Evaluation¶
6.1 Train on Full Dataset (for demonstration)¶
# For production, this would be trained on a separate training set
# Here we demonstrate with the full dataset
best_model = PLSRegression(n_components=4, scale=True, max_iter=500)
best_model.fit(X_preprocessed, y_encoded)
# On the same data (not a fair test, but for demonstration):
y_pred_full = best_model.predict(X_preprocessed).round().astype(int)
y_pred_full = np.clip(y_pred_full, 0, len(class_labels) - 1)
train_acc = accuracy_score(y_encoded, y_pred_full)
print(f"Training accuracy (full data): {train_acc:.3f}")
print(f"(Note: This is inflated because we're testing on training data)")
print(f"Use cross-validation accuracy ({np.mean(outer_accuracies):.3f}) as more realistic estimate.")
Part 7: Interpretation and Feature Importance¶
7.1 Feature Importance (PLS Loadings)¶
# PLS loadings show which wavenumbers distinguish classes
loadings = best_model.x_weights_ # or use coef_ depending on PLS variant
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(fs.wavenumber, loadings, linewidth=1.5)
ax.axhline(0, color='k', linestyle='--', alpha=0.3)
ax.set_xlabel('Wavenumber (cm⁻¹)')
ax.set_ylabel('PLS Loading (Importance)')
ax.set_title('Feature Importance: Wavenumber Contributions to Classification')
ax.grid(True, alpha=0.3)
plt.savefig('07_pls_loadings.png', dpi=150, bbox_inches='tight')
plt.close()
print("✓ Saved: 07_pls_loadings.png")
# Identify top contributing wavenumbers
top_n = 10
top_indices = np.argsort(np.abs(loadings.flatten()))[-top_n:]
top_wavenumbers = fs.wavenumber[top_indices]
top_loadings = loadings.flatten()[top_indices]
print(f"\nTop {top_n} wavenumbers by importance:")
for wn, loading in zip(top_wavenumbers, top_loadings):
print(f" {wn:.0f} cm⁻¹: {loading:.4f}")
Part 8: When Results Cannot Be Trusted¶
🚨 Critical warnings for this analysis:
- Perfect accuracy is suspicious
- Cross-validation accuracy (91%) is good but not implausibly high
-
If we'd seen 98%+, we'd investigate for:
- Batch confounding (all authentic oils from same day?)
- Data leakage (duplicate samples in CV folds?)
- Artificial separation (classes too well-separated in this example data?)
-
Small sample size (n=120)
- With 40 samples per class and 5-fold CV, test sets only have 8 samples per class
- Metrics are noisy; 95% CI on accuracy is roughly ±6%
-
Recommendation: Validate on independent dataset with n≥100
-
Example data may not represent real variation
- The example oils are in-house standards; real oils have more variability
-
Action: Test on market samples before deployment
-
Preprocessing parameter dependency
- PLS is sensitive to baseline correction (λ=100) and smoothing (window=7)
- Changing these may change results
-
Recommendation: Validate robustness to preprocessing choices (ablation study)
-
No explicit batch effect management
- Data acquired on single instrument/date
- Real deployment should use batch-aware CV and batch controls
- See: Statistical Power and Study Design Limits
How to Reproduce This Analysis¶
Run as Python Script¶
# Save the code above as: reference_analysis.py
python reference_analysis.py
# Outputs:
# 01_raw_spectra_by_class.png
# 02_mean_spectra.png
# 03_preprocessing_comparison.png
# 04_pca_scores.png
# 05_pca_loadings.png
# 06_cv_confusion_matrix.png
# 07_pls_loadings.png
Run as Jupyter Notebook¶
jupyter notebook
# New notebook, paste code from each section above
# Run cells sequentially
Exact Command-Line Verification¶
# Check FoodSpec version
python -c "import foodspec; print(foodspec.__version__)"
# Load data and print shape
python -c "from foodspec.data.loader import load_example_oils; fs = load_example_oils(); print(fs.x.shape, fs.metadata['oil_type'].unique())"
Summary of Results¶
| Metric | Value |
|---|---|
| Cross-validation accuracy | 91.2% ± 3.3% |
| Mean class recall | Authentic 96%, Seed 75%, Mixed 92% |
| PCA variance (PC1+PC2) | ~72% |
| Best PLS components | 4 |
| Training samples per class | 40 |
| Fold size per test | 8 |
Interpretation: - Model reliably distinguishes authentic olive oil from seed oil and mixtures - Authentic oils easiest to recognize (high recall) - Refined seed oil has highest misclassification rate (may overlap with Mixed) - PC1 captures main separation; PC2 adds refinement
See Also¶
- Reference Protocol — Full methodology
- Methods Text Generator — How to write this up for publication
- Model Evaluation and Validation — Deeper validation theory
- Oil Authentication Workflow — Domain-specific guidance
- Non-Goals & Limitations — When FoodSpec should not be used