Oil Authentication: Supervised Classification¶

Level: Beginner → Intermediate
Runtime: ~10 seconds
Key Concepts: Classification, cross-validation, confusion matrices, model discrimination

What You Will Learn¶

In this example, you'll learn how to: - Load spectroscopy data from CSV files - Train a classifier to distinguish oils by type - Evaluate model performance with cross-validation - Interpret confusion matrices and classification metrics - Visualize data structure using dimensionality reduction (PCA)

After completing this example, you'll understand the workflow for any classification problem in FoodSpec (fraud detection, quality assessment, authenticity verification).

Prerequisites¶

Basic Python knowledge
Familiarity with NumPy arrays and Pandas DataFrames
Understanding of supervised learning concepts (train/test, classification)
numpy, pandas, matplotlib, scikit-learn installed

Optional background: Read Chemometrics & ML Basics

The Problem¶

Real-world scenario: You're a food manufacturer testing whether your olive oil supplies are authentic. You have reference spectra for virgin olive oil, processed olive oil, and two adulterants (sunflower, canola). Can you build a classifier to automatically detect fake oils?

Data: Raman spectra (intensity vs. wavenumber) for 8 samples across 4 classes.

Goal: Train a model, evaluate with cross-validation, interpret results.

Step 1: Load Data¶

import numpy as np
import pandas as pd
from pathlib import Path

# Load synthetic oil spectra (8 samples × 1500 wavelengths)
data = pd.read_csv("examples/data/oil_synthetic.csv", index_col=0)
X = data.drop("OilType", axis=1).values
y = data["OilType"].values

print(f"Data shape: {X.shape}")  # (8, 1500)
print(f"Oil types: {np.unique(y)}")  # ['CanOil' 'OliveOil' 'ProcessedOl' 'SunflowerOil']

What's happening: - The CSV contains 8 spectra and their oil type labels - X contains the spectral intensities (predictors) - y contains the oil type labels (targets)

Step 2: Train with Cross-Validation¶

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# Initialize classifier
clf = RandomForestClassifier(n_estimators=50, random_state=42)

# Evaluate with 5-fold cross-validation
scores = cross_validate(
    clf, X, y, 
    cv=5, 
    scoring=["accuracy", "precision_macro", "recall_macro", "f1_macro"]
)

print(f"CV Accuracy: {scores['test_accuracy'].mean():.3f} ± {scores['test_accuracy'].std():.3f}")
print(f"CV F1: {scores['test_f1_macro'].mean():.3f} ± {scores['test_f1_macro'].std():.3f}")

What's happening: - RandomForestClassifier is trained on 4/5 of the data, tested on 1/5 - This repeats 5 times, rotating which fold is held out - We compute accuracy, precision, recall, F1 across all folds - Metrics near 1.0 indicate good discrimination

Step 3: Visualize Performance & Structure¶

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Get predictions from cross-validation
y_pred = cross_val_predict(clf, X, y, cv=5)
cm = confusion_matrix(y, y_pred)

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(cm, cmap="Blues")
ax.set_xticks(range(len(np.unique(y))))
ax.set_yticks(range(len(np.unique(y))))
ax.set_xticklabels(np.unique(y), rotation=45)
ax.set_yticklabels(np.unique(y))
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
ax.set_title("Oil Classification: Confusion Matrix")

# Add counts
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        text = ax.text(j, i, cm[i, j], ha="center", va="center", color="w")

plt.tight_layout()
plt.savefig("oil_auth_confusion.png", dpi=150, bbox_inches="tight")
plt.show()

Interpretation: - Diagonal elements = correct classifications - Off-diagonal = misclassifications - Perfect classifier = only non-zero diagonals - This plot reveals which oils are confused with others

Full Working Script¶

See the production script with enhanced documentation, output directory management, and additional analysis:

📄 examples/oil_authentication_quickstart.py – Full working code (35 lines)

Generated Figure¶

Confusion Matrix

Figure interpretation: - Rows = true oil types - Columns = predicted oil types - Perfect classification: all counts on diagonal - Model performance: assess which oils are well-distinguished

Key Takeaways¶

✅ Classification workflow: Load → Train → Cross-validate → Evaluate
✅ Cross-validation: Prevents overfitting by rotating train/test splits
✅ Confusion matrix: Shows misclassification patterns, not just accuracy
✅ Metrics matter: Precision/recall assess class-specific performance

Real-World Applications¶

🌾 Olive oil authentication: Detect counterfeit high-value oils
🍯 Honey fraud detection: Distinguish pure from adulterated honey
🧈 Butter authenticity: Identify margarine substitution
🥛 Milk origin verification: Grass vs. grain-fed dairy

Next Steps¶

Try it: Modify the classifier (e.g., use SVM, Logistic Regression)
Explore: Change cross-validation folds (cv=10) and observe variance
Learn more: Read Classification & Regression
Advance: See Oil Authentication Workflow for complete domain example

Interactive Notebook¶

For step-by-step exploration with visualizations:

📓 examples/tutorials/01_oil_authentication_teaching.ipynb

Figure provenance¶

Generated by scripts/generate_docs_figures.py
Output: ../assets/figures/oil_confusion.png