End-to-End Protocol Run: Unified FoodSpec API¶

Level: Capstone (Advanced)
Runtime: ~3 seconds
Key Concepts: Chainable API, workflow composition, reproducibility, audit trails

What You Will Learn¶

In this capstone example, you'll learn how to: - Master the Phase 1 unified FoodSpec API - Compose complete workflows: QC → Preprocess → Train → Evaluate → Export - Leverage built-in diagnostics for quality assurance - Implement reproducible science with full audit trails - Export results with provenance and complete parameter documentation

After completing this example, you'll understand best practices for professional, reproducible analysis workflows that meet regulatory and scientific standards.

Prerequisites¶

Completion of at least 2-3 prior examples (Oil Auth, Heating, Mixture)
Understanding of Python classes and method chaining
Knowledge of cross-validation and model evaluation
Familiarity with JSON and parameter documentation
numpy, pandas, scikit-learn, foodspec installed

Optional background: Read Protocols & YAML and Phase 1 Quickstart

The Problem¶

Real-world scenario: You're implementing a production system for oil classification. It needs to: 1. Pass quality checks on incoming data 2. Apply preprocessing consistently 3. Train a robust classifier with validation 4. Generate detailed diagnostics 5. Export with complete audit trail (what was done, when, by whom, parameters used)

Goal: Build and validate a reproducible end-to-end pipeline.

Step 1: Initialize & Quality Check¶

import numpy as np
import pandas as pd
from foodspec.core import FoodSpec
from foodspec.datasets import load_example_data

# Load or create your spectroscopy dataset
X, y = load_example_data("oil_classification")  # or your own data

# Initialize FoodSpec with protocol name
fs = FoodSpec(task="classification", name="oil_auth_production")

# Perform quality checks (QC)
qc_report = fs.quality_check(
    X=X, 
    y=y,
    check_type="complete",  # checks: missing values, outliers, class balance, etc.
)

print("Quality Check Report:")
print(f"  Data health score: {qc_report['health']:.2f}")
print(f"  Issues detected: {qc_report['issues']}")
print(f"  Recommendations: {qc_report['recommendations']}")

# Proceed only if quality is acceptable
if qc_report["health"] < 0.5:
    raise ValueError("Data quality too low. Address issues before proceeding.")

What's happening: - QC checks data for common issues (missing values, extreme outliers, class imbalance) - Health score ranges 0–1 (1 = perfect quality) - Issues and recommendations guide data improvement - Production practice: Never skip QC; log results for audit trail

Step 2: Preprocessing Pipeline¶

# Add preprocessing steps (chainable API)
fs.add_preprocessing_step(
    "baseline_removal",
    method="polyfit",
    order=5,
    description="Remove instrumental baseline"
)

fs.add_preprocessing_step(
    "normalization",
    method="vector",
    description="L2 normalization to remove intensity effects"
)

fs.add_preprocessing_step(
    "snv",  # Standard Normal Variate
    method="snv",
    description="Scatter correction for multiplicative effects"
)

# Apply preprocessing
X_processed = fs.preprocess(X, fit=True)  # fit=True: learn parameters from training data

print(f"Original shape: {X.shape}")
print(f"Processed shape: {X_processed.shape}")
print(f"Preprocessing pipeline: {[s['method'] for s in fs.preprocessing_steps]}")

Interpretation: - Baseline removal: Removes instrument artifacts (polynomial fit) - Normalization: Removes sample thickness effects (L2 norm) - SNV: Corrects multiplicative scatter effects (concentration-independent) - Preprocessing parameters are stored for consistency

Step 3: Train with Cross-Validation¶

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# Initialize model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    random_state=42,
    n_jobs=-1
)

# Perform cross-validation with detailed metrics
cv_results = cross_validate(
    model, X_processed, y,
    cv=5,
    scoring=["accuracy", "precision_macro", "recall_macro", "f1_macro"],
    return_train_score=True
)

# Summarize results
print("\nCross-Validation Results (5-fold):")
print(f"  Accuracy:  {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")
print(f"  Precision: {cv_results['test_precision_macro'].mean():.3f} ± {cv_results['test_precision_macro'].std():.3f}")
print(f"  Recall:    {cv_results['test_recall_macro'].mean():.3f} ± {cv_results['test_recall_macro'].std():.3f}")
print(f"  F1:        {cv_results['test_f1_macro'].mean():.3f} ± {cv_results['test_f1_macro'].std():.3f}")

# Check for overfitting
train_acc = cv_results['train_accuracy'].mean()
test_acc = cv_results['test_accuracy'].mean()
overfitting = train_acc - test_acc
print(f"\nOverfitting check:")
print(f"  Train accuracy: {train_acc:.3f}")
print(f"  Test accuracy:  {test_acc:.3f}")
print(f"  Gap: {overfitting:.3f}")
if overfitting > 0.10:
    print("  ⚠️  Significant overfitting detected!")
else:
    print("  ✓ Model generalization is good")

What's happening: - Cross-validation evaluates on unseen data (prevents overfitting) - Multiple metrics provide comprehensive performance view - Train-test gap indicates generalization quality - Results stored for reporting and reproducibility

Step 4: Generate Diagnostics¶

# Generate comprehensive diagnostics
diagnostics = fs.generate_diagnostics(
    X_processed=X_processed,
    y=y,
    model=model,
    cv_results=cv_results,
    include_pca=True,
    include_confusion_matrix=True
)

print("\nDiagnostics Summary:")
print(f"  PCA variance (PC1): {diagnostics['pca_variance'][0]:.3f}")
print(f"  Feature importance (top 3): {diagnostics['feature_importance'][:3]}")
print(f"  Class distribution: {diagnostics['class_counts']}")
print(f"  Data health: {diagnostics['data_health']:.2f}")

# Store diagnostics for later review
import json
with open("diagnostics.json", "w") as f:
    json.dump(diagnostics, f, indent=2, default=str)  # default=str for non-JSON types

What's happening: - Diagnostics provide comprehensive view of model and data - PCA variance: How much spectral variation is captured by first few components - Feature importance: Which wavelengths are most predictive - Class distribution: Balance assessment - All diagnostics stored for audit trail

Step 5: Export with Provenance¶

# Train final model on full dataset
model.fit(X_processed, y)

# Create comprehensive export with audit trail
export_data = {
    "metadata": {
        "timestamp": pd.Timestamp.now().isoformat(),
        "analyst": "Production System",
        "task": "oil_classification",
        "model": "RandomForestClassifier",
        "dataset": "oil_synthetic.csv"
    },
    "parameters": {
        "preprocessing": [
            {"method": s["method"], "params": s.get("params", {})}
            for s in fs.preprocessing_steps
        ],
        "model": model.get_params()
    },
    "performance": {
        "cv_accuracy": float(cv_results['test_accuracy'].mean()),
        "cv_f1": float(cv_results['test_f1_macro'].mean()),
        "cv_std": float(cv_results['test_accuracy'].std())
    },
    "diagnostics": diagnostics
}

# Save export
with open("export_oil_auth_model.json", "w") as f:
    json.dump(export_data, f, indent=2, default=str)

print("\n✓ Export complete!")
print(f"  Saved to: export_oil_auth_model.json")
print(f"  Model class: {model.__class__.__name__}")
print(f"  Training samples: {X_processed.shape[0]}")
print(f"  Spectral features: {X_processed.shape[1]}")

What's happening: - Metadata: Timestamp, analyst, task, dataset (who, what, when) - Parameters: Exact preprocessing steps + model hyperparameters (reproducibility) - Performance: Cross-validation metrics (validation evidence) - Diagnostics: All diagnostic data (transparency) - Everything saved as JSON (human-readable, versionable)

Step 6: Use the Trained Model¶

# Load new unknown sample
X_unknown = pd.read_csv("unknown_oil_sample.csv", index_col=0).values

# Apply same preprocessing (using learned parameters)
X_unknown_processed = fs.preprocess(X_unknown, fit=False)  # fit=False: use stored parameters

# Make predictions
predictions = model.predict(X_unknown_processed)
probabilities = model.predict_proba(X_unknown_processed)

print("\nPrediction for Unknown Sample:")
print(f"  Predicted class: {predictions[0]}")
print(f"  Confidence: {probabilities.max():.3f}")
print(f"  All probabilities: {dict(zip(model.classes_, probabilities[0]))}")

Critical point: Preprocessing must use same parameters learned during training (fit=False).

Full Working Script¶

See the production script with complete workflow:

📄 examples/phase1_quickstart.py – Full working code (139 lines)

Key Takeaways¶

✅ QC first: Always check data quality before training
✅ Chainable API: Compose workflows step-by-step with clear syntax
✅ Preprocessing consistency: Learn parameters on training data, apply to test/production
✅ Cross-validation: Essential for reliable performance estimates
✅ Comprehensive diagnostics: Understand your model and data
✅ Full export: Store metadata, parameters, metrics, and diagnostics for reproducibility

Production Best Practices¶

Practice	Why	How
QC first	Catch garbage early	Always check health score
Preprocessing parameters	Consistency	Save learned parameters, reuse on new data
Cross-validation	Prevents overfitting	Never evaluate on training data
Hyperparameter tuning	Model optimization	Use GridSearchCV or RandomizedSearchCV
Diagnostics	Transparency	Generate for every model
Audit trail	Reproducibility	Save metadata, parameters, metrics
Version control	Traceability	Commit models and exports to Git

Real-World Deployment¶

Your production system would: 1. Load unknown sample 2. Apply preprocessing (learned parameters) 3. Make prediction with trained model 4. Log prediction to audit trail 5. Alert if confidence below threshold 6. Store all results with timestamp

Advanced Topics¶

Want to go deeper? - Hyperparameter tuning: Optimize model parameters with GridSearchCV - Ensemble methods: Combine multiple models for robustness - Feature selection: Reduce wavelengths while maintaining accuracy - Retraining strategy: When to retrain with new samples - Model monitoring: Detect performance drift over time

See Reproducible Pipelines Workflow for complete details.

Next Steps¶

Try it: Run the full script end-to-end
Customize: Modify preprocessing steps, model hyperparameters
Test: Make predictions on new data using trained model
Deploy: Save/load models, integrate into production system
Learn more: Read Protocols & YAML

Interactive Notebook¶

For step-by-step exploration and parameter experimentation:

📓 examples/tutorials/05_protocol_unified_api_teaching.ipynb

Workflow Diagram¶

Load Data → QC Check → Preprocess → Train (CV) → Diagnostics → Export
             ↓         ↓            ↓
          Health   Parameters   Metrics
          Score    Stored       Analyzed

Example Output Structure¶

{
  "metadata": {
    "timestamp": "2026-01-06T14:30:00",
    "analyst": "Production System",
    "task": "oil_classification"
  },
  "parameters": {
    "preprocessing": [
      {"method": "baseline_removal", "order": 5},
      {"method": "normalization"}
    ],
    "model": {"n_estimators": 100, "max_depth": 15}
  },
  "performance": {
    "cv_accuracy": 0.95,
    "cv_f1": 0.94
  },
  "diagnostics": {...}
}

This is the foundation for production-grade FoodSpec workflows. 🚀

Figure provenance¶

Generated by scripts/generate_docs_figures.py
Output: ../assets/figures/architecture_flow.png