Skip to content

Statistical Power and Study Design Limits

Who: Study planners, researchers designing FoodSpec experiments, statisticians and reviewers evaluating study robustness.

What: Practical guidance on sample size, replicates, batch effects, and statistical power specific to food spectroscopy applications.

When: Use before starting a new FoodSpec study to determine adequate sample size and replication strategy.

When NOT: This is not a substitute for formal statistical power analysis (consult a statistician for complex designs); nor is it a replacement for domain-specific method validation.

Key Assumptions: - Studies follow the Reference Protocol structure - Samples are representative of intended population - You have reasonable estimates of effect size (from prior studies or pilot data)

What can go wrong: - Underpowered studies → false negatives, unreliable estimates - Overconfident sample size estimates → overfitting, poor generalization - Unmanaged batch effects → spurious results - Ignoring multiple testing → inflated Type I error


Quick Rules of Thumb for Food Spectroscopy Studies

Minimum Sample Sizes

Study Type Minimum n per Class Rationale Notes
Proof-of-concept / feasibility 15 Small pilot; hypothesis generation only High risk of overfitting; don't use for claims
Classification (exploratory) 30 Standard ML minimum; allows nested CV Recommended starting point
Classification (confirmatory) 50–100 Supports rigorous cross-validation Larger test set; robust metrics
Regression (exploratory) 30–50 Depends on feature count; rule: n ≥ 5 × n_features Increase if high-dimensional
Regression (confirmatory) 100–200 Higher power; validates on diverse test set Gold standard

Replication Strategy

Standard recommendation: 3 replicates per sample, 3 levels of variation:

For each biological sample:

  Level 1: Technical replicates (same vial, immediate rescans)
    → n_technical = 3
    → Captures instrument noise

  Level 2: Intra-day replicates (same sample, re-mounted)
    → n_intra_day = 3
    → Captures sample/mounting variability

  Level 3: Inter-day replicates (same sample, different days)
    → n_inter_day = 2–3
    → Captures temporal drift

  Total spectra per sample = 3 × 3 × 2 = 18 (or 3 × 3 × 3 = 27)

For n = 30 samples: total spectra = 30 × 18 = 540 (manageable)
For n = 100 samples: total spectra = 100 × 18 = 1800 (feasible; time-intensive)

Minimal replication (if resources constrained):

For each sample: 3 technical replicates only
  → Captures instrument noise
  → Does NOT capture biological/mounting variability
  → Use only for very robust discrimination (e.g., oils vs. non-oils)
  → Risk: Model may overfit to instrument-specific features

Sample Size Calculation: Formula

For Classification (Diagnostic Accuracy)

If targeting a specific sensitivity (true positive rate) or specificity (true negative rate):

n = (z_α/2 + z_β)² × p(1-p) / d²

where:
  z_α/2 = critical value for significance level α (α=0.05 → z=1.96)
  z_β = critical value for power (1-β); power=0.80 → z=0.84; power=0.90 → z=1.28
  p = expected sensitivity/specificity (0–1)
  d = acceptable margin of error (e.g., d=0.10 for ±10%)

Example: Achieve 90% sensitivity ± 10% with 80% power
  n = (1.96 + 0.84)² × 0.90 × 0.10 / 0.10²
  n = 7.84 × 0.09 / 0.01
  n ≈ 71 samples (per class)

For Regression (Predicting Continuous Outcome)

If targeting a specific R² or correlation coefficient:

n = (z_α/2 + z_β)² / (2 × r²)

where:
  r = expected correlation coefficient (0–1)
  other terms as above

Example: Detect correlation r=0.5 with 80% power
  n = (1.96 + 0.84)² / (2 × 0.5²)
  n = 7.84 / 0.5
  n ≈ 16 samples (conservative; increase to ≥30 for robustness)

Practical note: Add 10–20% buffer for dropouts/quality control failures.


Batch Effects and Confounding

What Are Batch Effects?

Systematic variations in spectral data caused by:

Source Impact Example
Instrument calibration Baseline shifts, intensity scaling Raman laser power drifts 5% over day
Temperature Peak shifts (~0.1–0.5 cm⁻¹/°C) Room temp varies 20–24 °C
Operator Sampling technique, vial mounting Different mounting pressure affects baseline
Sample storage Oxidation, phase changes, settling Oil exposed to light for 1 month before analysis
Time Detector aging, baseline drift Spectra acquired 6 months apart
Instrument geometry Laser focus, fiber coupling Change focus position by 0.5 mm

Magnitude: Batch effects can exceed biological signal; e.g., a 2% intensity shift from temperature variation may obscure a 1% difference due to sample composition.

Managing Batch Effects

Strategy 1: Design Orthogonality (Preferred)

Randomize batch assignment: - Assign samples to acquisition batches (days/instruments) randomly - Avoid confounding: don't put all "authentic" samples on day 1 and "adulterated" on day 2 - Include batch controls (same reference material) in every batch

Example randomization:
  Batch 1 (Day 1): 5 authentic, 5 adulterated, 1 reference material
  Batch 2 (Day 2): 5 authentic, 5 adulterated, 1 reference material
  Batch 3 (Day 1, different operator): 5 authentic, 5 adulterated, 1 reference material
  ...

Strategy 2: Batch-Aware Cross-Validation

Use stratified CV that respects batch structure:

from sklearn.model_selection import GroupKFold

# Each batch is a group
group_cv = GroupKFold(n_splits=5)

for train_idx, test_idx in group_cv.split(X, y, groups=batch_ids):
    # No data from the same batch in train and test
    # Better generalization to new batches

Strategy 3: Batch Correction (Advanced)

If batch effects unmanaged, use statistical batch correction:

ComBat or SVA (Surrogate Variable Analysis):
  1. Fit a linear model: spectrum ~ class + batch
  2. Estimate batch coefficients
  3. Remove batch contribution: spectrum_corrected = spectrum - batch_effect

Caveats:
  - Assumes batch effect is additive (true for most cases)
  - Requires ≥10 samples per batch for reliable estimation
  - May remove true biological variation if confounded with batch

Multiple Testing and Multiple Comparisons

The Problem

If testing k independent hypotheses with significance level α per test, the family-wise error rate (FWER) is:

FWER ≥ 1 - (1 - α)^k

Example: 10 independent tests, α=0.05 per test
  FWER ≥ 1 - (1 - 0.05)^10 = 1 - 0.59 = 0.41
  → 41% chance of at least one false positive!

Solutions

Method When to Use Conservativeness
Bonferroni Few tests (k < 10) Very conservative; high false negative rate
Holm–Bonferroni Few tests (k < 10) Less conservative than Bonferroni
FDR (Benjamini–Hochberg) Many tests (k > 10); exploratory Balanced; controls expected proportion of false positives
Permutation test No distributional assumptions Flexible; computationally intensive

Recommendation for FoodSpec:

  • If comparing k spectral bands or features for significance: use FDR correction (q < 0.05)
  • If performing k cross-validation folds: no correction needed (folds not independent tests)
  • If fitting k models and reporting best: use nested CV to avoid selection bias (no correction needed)

Effect Size and Clinical Significance

Distinguishing Statistical vs. Practical Significance

A large sample can detect tiny differences:

Example: Comparing oil authenticity scores

Method A (n=10 per group):
  Authentic: mean=90, SD=5
  Adulterated: mean=88, SD=5
  t-test: p=0.13 (not significant)

Method B (n=1000 per group):
  Authentic: mean=90, SD=5
  Adulterated: mean=88, SD=5
  t-test: p=0.001 (significant!)

But the 2-point difference is tiny—does it matter?
  → Effect size (Cohen's d) = 2/5 = 0.4 (small–medium)
  → Decision: Statistically significant but may not be practically meaningful

Always report effect sizes:

Measure Interpretation
Cohen's d 0.2 = small; 0.5 = medium; 0.8 = large
Cohen's f (ANOVA) 0.1 = small; 0.25 = medium; 0.4 = large
Cramér's V (categorical) 0.1 = small; 0.3 = medium; 0.5 = large
AUC (diagnostic) 0.7–0.8 = fair; 0.8–0.9 = good; >0.9 = excellent

Study Design Checklist

  • [ ] Sample size justified (power analysis or literature-based)
  • [ ] Replication strategy defined (technical, intra-day, inter-day; minimum 3 replicates)
  • [ ] Batch randomization (samples assigned randomly to batches, not confounded by class)
  • [ ] Batch controls (reference material in every batch)
  • [ ] Temporal separation (training and test from different dates/batches if possible)
  • [ ] Operator blinding (if feasible; prevents conscious/unconscious bias)
  • [ ] Exclusion criteria pre-defined (what spectra are unacceptable? why?)
  • [ ] Validation strategy selected (nested CV, hold-out test set, external cohort, or combination)
  • [ ] Multiple testing correction (if k > 5 independent tests, apply FDR or permutation test)
  • [ ] Effect sizes calculated (not just p-values)
  • [ ] Metadata logged (operator, date, temperature, instrument settings, storage conditions)

When Statistical Assumptions Are Violated

Non-Normality

Spectral intensity distributions are often non-normal (skewed, multimodal).

Consequence Solution
t-tests unreliable; confidence intervals wrong Use non-parametric tests (Mann–Whitney U, Kruskal–Wallis) or permutation tests
Linear regression biased Use robust regression (Huber, RANSAC) or quantile regression
PCA assumes normality Use ICA or other non-normal component models (usually not necessary in practice)

Heteroscedasticity (Unequal Variance)

Spectral data often have intensity-dependent noise (SNR improves with peak height).

Consequence Solution
t-tests and ANOVA unreliable Use Welch's t-test (doesn't assume equal variance) or robust methods
Confidence intervals wrong Use bootstrap or permutation CIs
Linear regression biased Use weighted least squares or robust regression

Autocorrelation (Within-Sample Dependence)

Replicates from the same sample are correlated.

Consequence Solution
Sample size overestimated; false confidence Use mixed-effects models (sample as random effect)
Standard errors understated Compute clustered SEs (by sample)
Simple CV unreliable Use grouped/leave-one-group-out CV (by sample)


Quick Reference: Sample Size Table (Classification)

Minimum n per class for 80% power, α=0.05, assuming expected sensitivity/specificity of 0.85:

Margin of Error Required n
±20% (rough estimate) 9
±15% (typical) 15
±10% (careful study) 33
±7% (rigorous study) 68
±5% (gold standard) 138

Rule of thumb: Aim for ≥30–50 per class as default; increase to ≥100 if claims are high-stakes or if effect size uncertain.