Skip to content

Troubleshooting

Quick Help

This page provides solutions to common technical issues encountered when using FoodSpec. For conceptual questions, see the FAQ. For reporting and reproducibility guidelines, see Reporting & Reproducibility.


Quick Problem Index

Stage Problem Quick Fix
Installation pip install fails Check Python version ≥3.8; update pip
Installation Import errors Verify same Python/pip environment
Data Missing labels Use metadata validation tools
Data Class imbalance Use F1/PR metrics; resample or weight
Preprocessing Over-smoothing Reduce Savitzky–Golay window
Preprocessing Poor baseline Tune ALS lambda; try rubberband baseline
ML Overfitting Regularize; simplify; use stratified CV
ML Data leakage Ensure preprocessing inside Pipeline
Stats Non-normal residuals Use nonparametric tests
Stats Multiple comparisons Apply FDR/Tukey correction
Visualization Unlabeled axes Label wavenumber (cm⁻¹), intensity (a.u.)
Workflow Wrong metrics Consult workflow design guide

Installation Issues

Problem: pip install foodspec fails

Symptoms:

ERROR: Could not find a version that satisfies the requirement foodspec
ERROR: No matching distribution found for foodspec

Causes & Solutions:

1. Python Version Incompatibility

Check your Python version:

python --version

Solution: FoodSpec requires Python ≥3.8. Upgrade Python:

# Using conda
conda create -n foodspec python=3.10
conda activate foodspec
pip install foodspec

# Using pyenv
pyenv install 3.10.0
pyenv local 3.10.0
pip install foodspec

2. Outdated pip

Update pip:

pip install --upgrade pip setuptools wheel
pip install foodspec

3. Network/Firewall Issues

Try alternative PyPI mirrors:

# Use a specific PyPI mirror
pip install --index-url https://pypi.org/simple foodspec

# Install with verbose output to diagnose
pip install -v foodspec

4. Package Name Confusion

Verify the correct package name:

# Correct
pip install foodspec

NOT: pip install food-spec, FoodSpec, foodspectra, etc.


Problem: Import errors after installation

Symptoms:

>>> import foodspec
ModuleNotFoundError: No module named 'foodspec'

Causes & Solutions:

1. Multiple Python Environments

Check which Python is active:

which python
which pip
python -c "import sys; print(sys.executable)"

Solution: Ensure pip and python are from the same environment:

# Use python -m pip instead
python -m pip install foodspec

# Verify installation
python -c "import foodspec; print(foodspec.__version__)"

2. Development Installation Not Linked

If installing from source:

# Editable install
cd /path/to/foodspec
pip install -e .

# Verify
python -c "import foodspec; print(foodspec.__file__)"

3. PYTHONPATH Issues

Check PYTHONPATH:

echo $PYTHONPATH

Solution: Add FoodSpec to PYTHONPATH (if needed):

export PYTHONPATH="/path/to/foodspec/src:$PYTHONPATH"


Data Loading Issues

  • Verify file paths are correct relative to your working directory; prefer absolute paths when scripting.
  • Confirm delimiters/headers match loader expectations (e.g., wavenumber column present for CSV/HDF5 helpers).
  • For registry-driven runs, check that metadata tables point to existing files and have consistent sample IDs.

Missing Dependencies

Problem: Optional dependencies not installed

Symptoms:

>>> from foodspec.visualization import plot_spectra
ImportError: matplotlib is required for visualization. Install with: pip install foodspec[viz]

Solution: Install optional dependency groups:

# Visualization (matplotlib, seaborn)
pip install foodspec[viz]

# Machine learning (scikit-learn, xgboost)
pip install foodspec[ml]

# All optional dependencies
pip install foodspec[all]

# Multiple groups
pip install foodspec[viz,ml]

Available groups: - viz: Plotting and visualization - ml: Machine learning models (RF, XGBoost) - notebooks: Jupyter notebook support - dev: Development tools (pytest, black, mypy) - docs: Documentation building (mkdocs, mkdocstrings) - all: All optional dependencies


Problem: Conflicting dependency versions

Symptoms:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
ERROR: foodspec requires numpy>=1.20, but you have numpy 1.19.5

Solution 1: Upgrade conflicting packages

pip install --upgrade numpy scipy scikit-learn
pip install foodspec

Solution 2: Use a clean environment

conda create -n foodspec-clean python=3.10
conda activate foodspec-clean
pip install foodspec

Solution 3: Use conda for dependency management

conda install -c conda-forge numpy scipy scikit-learn matplotlib
pip install foodspec


Shape/Axis Mismatch Errors

Problem: "Shapes do not match" during preprocessing

Symptoms:

>>> from foodspec.preprocessing import baseline_als
>>> X_corrected = baseline_als(X)
ValueError: operands could not be broadcast together with shapes (100, 1800) (1801,)

Diagnosis:

import numpy as np

print(f"X shape: {X.shape}")  # e.g., (100, 1800)
print(f"wavenumbers shape: {wavenumbers.shape}")  # e.g., (1801,)

# Problem: wavenumbers has 1801 elements, but X has 1800 columns

Causes & Solutions:

1. Wavenumber Grid Mismatch

Solution: Ensure wavenumber array matches spectral columns:

# Check alignment
assert X.shape[1] == len(wavenumbers), f"Mismatch: {X.shape[1]} vs {len(wavenumbers)}"

# If mismatch, trim or interpolate
if X.shape[1] != len(wavenumbers):
    # Option A: Trim wavenumbers to match X
    wavenumbers = wavenumbers[:X.shape[1]]

    # Option B: Trim X to match wavenumbers
    X = X[:, :len(wavenumbers)]

    # Option C: Interpolate to common grid (recommended)
    from foodspec.preprocessing import interpolate_to_grid
    X, wavenumbers = interpolate_to_grid(X, wavenumbers, new_grid=np.arange(4000, 650, -2))

2. Row vs. Column Confusion

Problem: Transpose needed

# Wrong: X is (n_wavenumbers, n_samples) instead of (n_samples, n_wavenumbers)
print(X.shape)  # (1800, 100) - WRONG

# Solution: Transpose
X = X.T
print(X.shape)  # (100, 1800) - CORRECT

FoodSpec convention: Rows = samples, Columns = wavenumbers

3. 1D vs. 2D Array

Problem: Single spectrum treated as 2D

# Wrong
single_spectrum = X[0]  # Shape: (1800,)
baseline_als(single_spectrum)  # Error: expects 2D

# Solution: Reshape to 2D
single_spectrum = X[0:1]  # Shape: (1, 1800)
# OR
single_spectrum = X[0].reshape(1, -1)
baseline_als(single_spectrum)  # Works


Problem: "Axis out of range" errors

Symptoms:

>>> from foodspec.ml import fit_pls
>>> model = fit_pls(X, y, n_components=10)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1

Diagnosis:

print(f"X ndim: {X.ndim}")  # Should be 2
print(f"X shape: {X.shape}")
print(f"y ndim: {y.ndim}")  # Should be 1 for labels

Solution:

# Ensure X is 2D
if X.ndim == 1:
    X = X.reshape(1, -1)  # Single sample

# Ensure y is 1D (for classification/regression)
if y.ndim > 1:
    y = y.ravel()  # Flatten (100, 1) → (100,)


NaNs After Preprocessing

Problem: NaNs appear after baseline correction

Symptoms:

>>> X_corrected = baseline_als(X, lam=1e6, p=0.01)
>>> np.isnan(X_corrected).sum()
1500  # Many NaNs!

Causes & Solutions:

1. Input Contains NaNs

Check input:

print(f"NaNs in input: {np.isnan(X).sum()}")

Solution: Remove or impute NaNs before preprocessing:

# Option A: Drop samples with NaNs
mask = ~np.isnan(X).any(axis=1)
X_clean = X[mask]

# Option B: Impute with interpolation
from scipy.interpolate import interp1d
for i in range(X.shape[0]):
    if np.isnan(X[i]).any():
        nan_mask = np.isnan(X[i])
        not_nan = ~nan_mask
        X[i, nan_mask] = np.interp(
            np.where(nan_mask)[0],
            np.where(not_nan)[0],
            X[i, not_nan]
        )

2. Division by Zero in Normalization

Problem: Zero or near-zero standard deviation

# SNV normalization: (X - mean) / std
# If std ≈ 0 → division by zero → NaN

from foodspec.preprocessing import snv
X_norm = snv(X)

# Diagnosis
stds = X.std(axis=1, ddof=1)
print(f"Samples with std < 1e-6: {(stds < 1e-6).sum()}")

Solution: Add epsilon to denominator or filter flat spectra:

# Option A: Filter flat spectra
threshold = 1e-4
mask = X.std(axis=1, ddof=1) > threshold
X_filtered = X[mask]
X_norm = snv(X_filtered)

# Option B: Custom SNV with epsilon
def snv_safe(X, eps=1e-8):
    mean = X.mean(axis=1, keepdims=True)
    std = X.std(axis=1, ddof=1, keepdims=True)
    std = np.where(std < eps, eps, std)  # Avoid division by zero
    return (X - mean) / std

X_norm = snv_safe(X)

3. Baseline Correction Failure

Problem: ALS baseline correction fails on saturated spectra

# Diagnosis: Check for saturation
print(f"Max intensity: {X.max()}")
print(f"Saturated pixels: {(X > 3.0).sum()}")  # Absorbance > 3.0

Solution: Clip intensities or skip saturated spectra:

# Clip absorbance to reasonable range
X_clipped = np.clip(X, -0.5, 3.0)
X_corrected = baseline_als(X_clipped, lam=1e6, p=0.01)


Problem: NaNs after derivative calculation

Symptoms:

>>> from foodspec.preprocessing import savgol_filter
>>> X_deriv = savgol_filter(X, window_length=11, polyorder=2, deriv=1)
>>> np.isnan(X_deriv).sum()
50  # NaNs at edges

Cause: Edge effects in Savitzky-Golay filter

Solution: Use mode='interp' or trim edges:

from scipy.signal import savgol_filter

# Option A: Use interp mode (extrapolates to edges)
X_deriv = savgol_filter(X, window_length=11, polyorder=2, deriv=1, mode='interp', axis=1)

# Option B: Trim edges
window = 11
edge = window // 2
X_deriv = savgol_filter(X, window_length=window, polyorder=2, deriv=1, axis=1)
X_deriv = X_deriv[:, edge:-edge]  # Remove edge columns
wavenumbers = wavenumbers[edge:-edge]  # Also trim wavenumbers


Model Overfitting / Too-Good Accuracy

Problem: Suspiciously high accuracy (>95%)

Symptoms:

>>> from sklearn.model_selection import cross_val_score
>>> scores = cross_val_score(RandomForestClassifier(), X, y, cv=5)
>>> scores.mean()
0.989  # 98.9% accuracy - too good to be true!

Diagnosis Checklist:

# 1. Check for data leakage (replicate leakage)
print(f"Number of samples: {len(np.unique(sample_ids))}")
print(f"Number of spectra: {len(X)}")
print(f"Replicates per sample: {len(X) / len(np.unique(sample_ids))}")

# 2. Check train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

# 3. Check class balance
from collections import Counter
print(f"Class distribution: {Counter(y)}")

# 4. Check for preprocessing leakage
# Did you normalize BEFORE splitting? That's leakage!

Common Causes & Solutions:

1. Replicate Leakage

Problem: Technical replicates split across train/test

Diagnosis:

from sklearn.model_selection import GroupKFold, KFold

# Random CV (leaky)
cv_random = KFold(n_splits=5, shuffle=True, random_state=42)
scores_random = cross_val_score(model, X, y, cv=cv_random)

# Grouped CV (correct)
cv_grouped = GroupKFold(n_splits=5)
scores_grouped = cross_val_score(model, X, y, cv=cv_grouped, groups=sample_ids)

print(f"Random CV:  {scores_random.mean():.3f}")
print(f"Grouped CV: {scores_grouped.mean():.3f}")
print(f"Drop: {scores_random.mean() - scores_grouped.mean():.3f}")

# If drop > 0.10 → replicate leakage!

Solution: Always use grouped CV

from foodspec.ml.validation import grouped_cross_validation

results = grouped_cross_validation(
    X, y,
    groups=sample_ids,  # Critical!
    model=RandomForestClassifier(),
    n_splits=5,
    n_repeats=10
)
print(f"Realistic Accuracy: {results['accuracy_mean']:.3f} ± {results['accuracy_ci']:.3f}")

2. Preprocessing Leakage

Problem: Normalization fit on entire dataset before splitting

Wrong:

# ❌ WRONG: Preprocessing before splitting
X_norm = snv(X)  # Uses statistics from entire dataset
X_train, X_test = train_test_split(X_norm, y)
model.fit(X_train, y_train)

Correct:

# ✅ CORRECT: Preprocessing within CV folds
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler', StandardScaler()),  # Fit only on train in each fold
    ('model', RandomForestClassifier())
])

scores = cross_val_score(pipe, X, y, cv=GroupKFold(5), groups=sample_ids)

3. Overfitting Small Datasets

Problem: More features than samples (p >> n)

Diagnosis:

n_samples, n_features = X.shape
print(f"Samples: {n_samples}, Features: {n_features}")
print(f"Feature-to-sample ratio: {n_features / n_samples:.1f}")

# If ratio > 10 → high overfitting risk

Solution: Reduce features or regularize

# Option A: Feature selection (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=min(50, n_samples // 2))
X_reduced = pca.fit_transform(X)

# Option B: Regularized models
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l1', solver='saga', C=0.1)  # L1 regularization

# Option C: Simpler models
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()  # Good for p >> n


Domain Shift Failure

Problem: Model fails on new batches/instruments

Symptoms:

# Trained on Batch 1-4, tested on Batch 5
>>> model.fit(X_train, y_train)  # Batches 1-4
>>> accuracy_train = model.score(X_train, y_train)
>>> accuracy_test = model.score(X_test, y_test)  # Batch 5
>>> print(f"Train: {accuracy_train:.3f}, Test: {accuracy_test:.3f}")
Train: 0.93, Test: 0.65  # 28% drop!

Diagnosis:

# Visualize batch separation (PCA)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 6))
for batch in np.unique(batches):
    mask = batches == batch
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=f'Batch {batch}', alpha=0.7)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.title('Batch Effect Visualization')
plt.show()

# If batches form distinct clusters → domain shift

Causes & Solutions:

1. Instrument/Day Variability

Solution A: Batch Correction (ComBat)

from neuroCombat import neuroCombat

# Harmonize spectra to remove batch effects
X_harmonized = neuroCombat(
    dat=X.T,  # Features × samples
    covars={'batch': batches},
    categorical_cols=['batch']
)['data'].T

# Re-train on harmonized data
model.fit(X_harmonized_train, y_train)
accuracy_harmonized = model.score(X_harmonized_test, y_test)
print(f"Post-Harmonization: {accuracy_harmonized:.3f}")

Solution B: Transfer Learning

from foodspec.ml.harmonization import transfer_component_analysis

# Align source (old batches) to target (new batch)
X_aligned = transfer_component_analysis(
    X_source=X_train,
    X_target=X_test,
    n_components=10
)

# Re-train on aligned data
model.fit(X_aligned_train, y_train)

Solution C: Standard Addition (Calibration Transfer)

# Measure standard samples on both instruments
# Use piecewise direct standardization (PDS)
from foodspec.ml.calibration import piecewise_direct_standardization

X_test_corrected = piecewise_direct_standardization(
    X_source=X_train_standards,
    X_target=X_test_standards,
    X_to_correct=X_test,
    window_size=9
)

2. Temperature/Humidity Drift

Solution: Include environmental covariates or normalize by reference

# Option A: Reference normalization (MSC to reference spectrum)
from foodspec.preprocessing import msc

reference = X_train.mean(axis=0)  # Use training mean as reference
X_train_norm = msc(X_train, reference=reference)
X_test_norm = msc(X_test, reference=reference)

# Option B: Model environmental variables
import pandas as pd
metadata = pd.DataFrame({
    'temperature': [...],
    'humidity': [...],
    'spectrum': X.tolist()
})

# Include as features or stratify


Reproducibility Mismatch

Problem: Results differ across runs despite setting seed

Symptoms:

# Run 1
>>> np.random.seed(42)
>>> scores1 = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5)
>>> scores1.mean()
0.873

# Run 2 (same code)
>>> np.random.seed(42)
>>> scores2 = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5)
>>> scores2.mean()
0.879  # Different!

Causes & Solutions:

1. Missing Random State in CV

Problem: CV splitter not seeded

# Wrong
cv = KFold(n_splits=5, shuffle=True)  # No random_state!

# Correct
cv = KFold(n_splits=5, shuffle=True, random_state=42)

2. Library Version Mismatch

Check versions:

import sklearn, numpy, scipy, foodspec

print(f"scikit-learn: {sklearn.__version__}")
print(f"numpy: {numpy.__version__}")
print(f"scipy: {scipy.__version__}")
print(f"foodspec: {foodspec.__version__}")

Solution: Document and freeze versions

# Save environment
pip freeze > requirements.txt

# Or use conda
conda env export > environment.yml

# Share requirements.txt with collaborators

Known version-dependent behaviors: - NumPy <1.20 vs ≥1.20: RNG changed (use np.random.Generator for consistency) - scikit-learn 0.24 vs 1.0+: random_state behavior changed in some estimators

3. Parallelism Non-Determinism

Problem: n_jobs=-1 causes non-deterministic behavior

Solution: Set n_jobs=1 for reproducibility (slower but deterministic)

# For reproducibility
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=1)

# For speed (may not be perfectly reproducible)
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

4. Floating-Point Precision

Problem: Different hardware (CPU vs GPU, Intel vs ARM) gives slightly different results

Solution: Accept small differences (<1e-6) or use lower precision

# Compare with tolerance
np.testing.assert_allclose(result1, result2, rtol=1e-5, atol=1e-6)

# Round for comparison
result1_rounded = np.round(result1, decimals=5)
result2_rounded = np.round(result2, decimals=5)
assert np.allclose(result1_rounded, result2_rounded)


Quick Diagnostic Script

Run this script to diagnose common issues:

import numpy as np
import sys

def diagnose_data(X, y=None, wavenumbers=None, sample_ids=None):
    """Comprehensive data diagnostics."""

    print("="*60)
    print("FOODSPEC DATA DIAGNOSTICS")
    print("="*60)

    # 1. Shape checks
    print(f"\n[1] SHAPE CHECKS")
    print(f"   X shape: {X.shape}")
    print(f"   X dtype: {X.dtype}")
    if y is not None:
        print(f"   y shape: {y.shape}")
        print(f"   y dtype: {y.dtype}")
    if wavenumbers is not None:
        print(f"   wavenumbers shape: {wavenumbers.shape}")
        if X.shape[1] != len(wavenumbers):
            print(f"   ⚠️  WARNING: Shape mismatch! {X.shape[1]} != {len(wavenumbers)}")

    # 2. Missing values
    print(f"\n[2] MISSING VALUES")
    n_nan = np.isnan(X).sum()
    n_inf = np.isinf(X).sum()
    print(f"   NaNs: {n_nan} ({100*n_nan/X.size:.2f}%)")
    print(f"   Infs: {n_inf} ({100*n_inf/X.size:.2f}%)")
    if n_nan > 0 or n_inf > 0:
        print(f"   ⚠️  WARNING: Missing/infinite values detected!")

    # 3. Intensity range
    print(f"\n[3] INTENSITY RANGE")
    print(f"   Min: {X.min():.4f}")
    print(f"   Max: {X.max():.4f}")
    print(f"   Mean: {X.mean():.4f}")
    print(f"   Std: {X.std():.4f}")
    if X.max() > 5.0:
        print(f"   ⚠️  WARNING: Unusually high absorbance (>5.0)")
    if X.min() < -1.0:
        print(f"   ⚠️  WARNING: Negative absorbance (<-1.0)")

    # 4. Flat spectra
    print(f"\n[4] FLAT SPECTRA CHECK")
    stds = X.std(axis=1, ddof=1)
    n_flat = (stds < 1e-4).sum()
    print(f"   Flat spectra (std < 1e-4): {n_flat} ({100*n_flat/len(X):.2f}%)")
    if n_flat > 0:
        print(f"   ⚠️  WARNING: Flat spectra may cause normalization issues")

    # 5. Class balance (if labels provided)
    if y is not None:
        print(f"\n[5] CLASS BALANCE")
        from collections import Counter
        counts = Counter(y)
        for label, count in sorted(counts.items()):
            print(f"   {label}: {count} ({100*count/len(y):.1f}%)")
        min_count = min(counts.values())
        max_count = max(counts.values())
        imbalance_ratio = max_count / min_count
        if imbalance_ratio > 3:
            print(f"   ⚠️  WARNING: Severe class imbalance (ratio: {imbalance_ratio:.1f})")

    # 6. Replicate structure (if sample_ids provided)
    if sample_ids is not None:
        print(f"\n[6] REPLICATE STRUCTURE")
        n_samples = len(np.unique(sample_ids))
        n_spectra = len(sample_ids)
        avg_reps = n_spectra / n_samples
        print(f"   Unique samples: {n_samples}")
        print(f"   Total spectra: {n_spectra}")
        print(f"   Avg replicates/sample: {avg_reps:.1f}")
        if avg_reps > 1:
            print(f"   ⚠️  IMPORTANT: Use grouped CV to prevent replicate leakage!")

    # 7. Feature-to-sample ratio
    print(f"\n[7] OVERFITTING RISK")
    n_samples, n_features = X.shape
    ratio = n_features / n_samples
    print(f"   Samples: {n_samples}")
    print(f"   Features: {n_features}")
    print(f"   Feature/Sample ratio: {ratio:.1f}")
    if ratio > 10:
        print(f"   ⚠️  WARNING: High overfitting risk (p >> n). Consider PCA/feature selection.")

    # 8. System info
    print(f"\n[8] SYSTEM INFO")
    print(f"   Python: {sys.version.split()[0]}")
    try:
        import sklearn, scipy, foodspec
        print(f"   NumPy: {np.__version__}")
        print(f"   SciPy: {scipy.__version__}")
        print(f"   scikit-learn: {sklearn.__version__}")
        print(f"   FoodSpec: {foodspec.__version__}")
    except ImportError as e:
        print(f"   ⚠️  Missing package: {e}")

    print("\n" + "="*60)
    print("DIAGNOSTICS COMPLETE")
    print("="*60)

# Usage:
# diagnose_data(X, y, wavenumbers, sample_ids)

Still Having Issues?

If your problem isn't covered here:

  1. Check the FAQ: Common questions answered
  2. Search existing issues: GitHub Issues
  3. Ask for help: Open a new issue with:
  4. FoodSpec version (import foodspec; print(foodspec.__version__))
  5. Python version (python --version)
  6. Minimal reproducible example
  7. Full error traceback
  8. Consult documentation: