Metric Interpretation & Significance Tables¶
Context Block
Purpose: Comprehensive reference for interpreting classification, regression, and statistical metrics with significance thresholds, context-dependent guidance, and quality criteria.
Audience: Analysts, reviewers, quality control labs, regulatory auditors
Prerequisites: Basic statistics knowledge (means, standard deviations, p-values)
Related: Metrics & Evaluation | Statistical Power | Hypothesis Testing
Overview¶
Metric values alone are insufficient β context determines significance. A p-value of 0.001 means nothing if the effect size is negligible. An AUC of 0.95 is excellent for balanced data but may hide poor minority class performance. This page provides:
- Threshold tables for classification, regression, and effect sizes
- Context-dependent interpretation (sample size, class balance, domain)
- Significance tests to compare models or validate improvements
- Red flags indicating when metrics mislead
Classification Metrics¶
Performance Thresholds¶
| Metric | Range | Poor | Fair | Good | Excellent | Recommended Significance Test |
|---|---|---|---|---|---|---|
| Accuracy | 0β1 | <0.7 | 0.7β0.8 | 0.8β0.9 | >0.9 | McNemar's test (paired), permutation test |
| F1 Score | 0β1 | <0.6 | 0.6β0.75 | 0.75β0.85 | >0.85 | Bootstrap CI on F1 |
| AUC-ROC | 0β1 | <0.7 | 0.7β0.8 | 0.8β0.9 | >0.9 | DeLong's test (paired AUCs) |
| Precision | 0β1 | <0.7 | 0.7β0.85 | 0.85β0.95 | >0.95 | Binomial confidence interval |
| Recall (Sensitivity) | 0β1 | <0.6 | 0.6β0.8 | 0.8β0.9 | >0.9 | Binomial confidence interval |
| Specificity | 0β1 | <0.6 | 0.6β0.8 | 0.8β0.9 | >0.9 | Binomial confidence interval |
| Balanced Accuracy | 0β1 | <0.65 | 0.65β0.75 | 0.75β0.85 | >0.85 | Permutation test |
| Matthews Correlation (MCC) | -1β1 | <0.3 | 0.3β0.5 | 0.5β0.7 | >0.7 | Bootstrap CI on MCC |
| Cohen's Kappa | -1β1 | <0.4 | 0.4β0.6 | 0.6β0.8 | >0.8 | Bootstrap CI on Kappa |
Context-Dependent Interpretation¶
Imbalanced Data (Class Ratio > 3:1)¶
- DO NOT use accuracy (misleading when minority class is important)
- Prefer: F1, Balanced Accuracy, AUC, MCC, Kappa
- Report: Per-class precision/recall + confusion matrix
- Example: 95% accuracy detecting adulteration (1% prevalence) = 0% recall on adulterants
Rare Event Detection (Prevalence < 5%)¶
- Prioritize recall over precision (minimize false negatives)
- Accept lower precision (tolerate false alarms)
- Use: Cost-sensitive learning (higher penalty for false negatives)
- Example: Food safety (miss no contaminated samples, even if many false positives)
Screening vs. Confirmation¶
- Screening: High recall (>0.95), lower precision acceptable
- Confirmation: High precision (>0.95), lower recall acceptable
- Two-stage workflow: Screen broadly β confirm positives with reference method
Small Sample Size (n < 100)¶
- Wide confidence intervals (report 95% CI, not just point estimate)
- Beware overfitting: Cross-validated metrics < training metrics
- Use: Nested cross-validation, repeated CV (5Γ2 or 10Γ5)
- Red flag: Training accuracy = 1.0, test accuracy < 0.8
Regression Metrics¶
Performance Thresholds¶
| Metric | Range | Poor | Fair | Good | Excellent | Significance Test |
|---|---|---|---|---|---|---|
| RΒ² (Coefficient of Determination) | 0β1 | <0.5 | 0.5β0.7 | 0.7β0.85 | >0.85 | F-test (nested models), permutation test |
| Adjusted RΒ² | 0β1 | <0.4 | 0.4β0.6 | 0.6β0.8 | >0.8 | Compare with RΒ² (overfitting check) |
| RMSE (Root Mean Squared Error) | 0ββ | >2Ο | 1β2Ο | 0.5β1Ο | <0.5Ο | Bootstrap CI; compare to baseline RMSE |
| MAE (Mean Absolute Error) | 0ββ | >1.5Ο | 0.75β1.5Ο | 0.4β0.75Ο | <0.4Ο | Bootstrap CI |
| MAPE (Mean Absolute % Error) | 0ββ | >20% | 10β20% | 5β10% | <5% | Bootstrap CI |
| QΒ² (Cross-validated RΒ²) | -ββ1 | <0.4 | 0.4β0.6 | 0.6β0.8 | >0.8 | Compare with RΒ² (overfitting = RΒ² >> QΒ²) |
Ο = standard deviation of target variable (y)
Context-Dependent Interpretation¶
Small vs. Large Datasets¶
- Small (n < 50): RΒ² overoptimistic β report cross-validated QΒ²
- Large (n > 1000): Tiny RΒ² improvements statistically significant but not practically meaningful
- Check effect size: RMSE reduction > 10% of Ο for practical significance
Heteroscedastic Data (Error Variance β Constant)¶
- RMSE penalizes large errors (sensitive to outliers)
- MAE more robust to heteroscedasticity
- Use: Weighted least squares or quantile regression
Multicollinearity (VIF > 5)¶
- RΒ² inflated (overfitting to correlated predictors)
- Report VIF (Variance Inflation Factor) for all predictors
- Use: Ridge regression (L2 penalty) or PLS (latent variables)
Overfitting Detection¶
| Symptom | Diagnosis | Remedy |
|---|---|---|
| RΒ² >> QΒ² (gap > 0.2) | Overfitting to training data | Increase Ξ» penalty, reduce features, add data |
| Adj. RΒ² << RΒ² (gap > 0.1) | Too many features for sample size | Remove low-importance features |
| Training RΒ² = 1.0 | Perfect fit (memorization) | Check for data leakage or duplicate samples |
Statistical Significance & p-values¶
p-value Interpretation¶
| p-value | Interpretation | Caveat |
|---|---|---|
| p < 0.001 | Highly statistically significant | Check effect size (small effect + large n β low p) |
| p < 0.01 | Statistically significant | Not proof of causation; check assumptions |
| p < 0.05 | Marginally significant | 5% false positive rate (Ξ±); correct for multiple tests |
| p > 0.05 | Not statistically significant | β proof of no effect (may be underpowered) |
Critical Warnings¶
When p-values Mislead
- Small effect + large n β p < 0.001 but practically irrelevant (e.g., AUC difference = 0.51 vs. 0.50)
- Large effect + small n β p > 0.05 but important (e.g., 50% improvement, but n=15)
- Multiple testing β p < 0.05 expected by chance (5% false positive rate)
- p-hacking β Trying many tests until p < 0.05 (cherry-picking)
Solution: Always report effect size alongside p-value + confidence intervals.
Effect Sizes¶
Thresholds & Interpretation¶
| Effect Size | Cohen's d | Ξ·Β² (ANOVA) | CramΓ©r's V | Interpretation | Practical Significance |
|---|---|---|---|---|---|
| Negligible | <0.2 | <0.01 | <0.1 | Trivial difference | Unlikely to matter in practice |
| Small | 0.2β0.5 | 0.01β0.06 | 0.1β0.3 | Detectable with instruments | May matter in high-precision tasks |
| Medium | 0.5β0.8 | 0.06β0.14 | 0.3β0.5 | Moderate difference | Relevant for quality control |
| Large | 0.8β1.2 | 0.14β0.20 | 0.5β0.7 | Substantial difference | Clear practical impact |
| Very Large | >1.2 | >0.20 | >0.7 | Dominant effect | Major effect; check for artifacts |
When to Use Which Effect Size¶
| Scenario | Recommended Effect Size | Reason |
|---|---|---|
| Two-group comparison (t-test) | Cohen's d | Standardized mean difference (unit-free) |
| ANOVA (3+ groups) | Ξ·Β² (eta-squared) | Proportion of variance explained |
| Categorical association (Chi-square) | CramΓ©r's V | Normalized for table size |
| Regression | RΒ² or fΒ² (Cohen's f-squared) | Variance explained by predictor(s) |
| Non-parametric tests | rank-biserial r | Rank-based effect size |
Effect Size + p-value Decision Matrix¶
| Effect Size | p-value | Interpretation | Action |
|---|---|---|---|
| Large | p < 0.05 | Significant + meaningful | Proceed with confidence |
| Large | p > 0.05 | Likely underpowered | Increase sample size; may still be real |
| Small | p < 0.05 | Significant but trivial | Caution: Statistically significant β important |
| Small | p > 0.05 | Not significant, not meaningful | No evidence of effect |
Sample Size & Power¶
Minimum Sample Size Guidelines¶
| Analysis Type | Minimum n | Preferred n | Rationale |
|---|---|---|---|
| t-test | 30 per group | 50 per group | Central Limit Theorem (CLT) applies at nβ30 |
| ANOVA (3 groups) | 20 per group | 30 per group | Equal group sizes; robust to normality at nβ₯20 |
| Linear regression | 10β15 per predictor | 20 per predictor | Prevent overfitting (Harrell's rule: n/10) |
| PLS-DA | 5β10 per latent variable | 10β15 per LV | Spectroscopy rule: 5β10 samples per LV |
| Cross-validation | 50 total | 100 total | Stratified 5-fold requires β₯10 per fold |
Power Analysis (Detecting True Effects)¶
Statistical power = Probability of detecting effect when it exists (1 - Ξ²)
| Power | Interpretation | Consequence of Low Power |
|---|---|---|
| 0.80 (80%) | Conventional minimum | 20% chance of false negative (Type II error) |
| 0.90 (90%) | Recommended for critical decisions | 10% chance of false negative |
| <0.50 (50%) | Underpowered (coin flip) | High risk of missing real effects |
Factors affecting power:
- Sample size (n) β β Power β
- Effect size β β Power β
- Significance level (Ξ±) β β Power β (but more false positives)
- Variability (Ο) β β Power β
Power calculation tools:
- R:
pwrpackage - Python:
statsmodels.stats.power - Online: G*Power, GPower calculator
Multiple Testing Correction¶
When Multiple Tests Inflate False Positives¶
Testing 100 hypotheses at Ξ± = 0.05 β expect 5 false positives by chance.
Example: Testing 1000 wavenumbers for group differences β ~50 false positives at Ξ± = 0.05.
Correction Methods¶
| Method | Control | Description | Use When |
|---|---|---|---|
| Bonferroni | FWER | Ξ±_corrected = Ξ± / m (m = # tests) | Few tests (<20); very conservative |
| Holm-Bonferroni | FWER | Step-down Bonferroni (less conservative) | Moderate tests (20β100) |
| Benjamini-Hochberg (BH) | FDR | Controls false discovery rate | Many tests (>100); exploratory analysis |
| Permutation tests | Exact p-value | Empirical null distribution | Non-parametric; computational cost OK |
FWER = Family-Wise Error Rate (probability of β₯1 false positive)
FDR = False Discovery Rate (expected proportion of false positives)
Bonferroni Example¶
- Ξ± = 0.05, m = 20 tests
- Ξ±_corrected = 0.05 / 20 = 0.0025
- Reject Hβ if p < 0.0025 (not 0.05)
Red Flags: When Metrics Mislead¶
Classification Red Flags¶
| Red Flag | Problem | Solution |
|---|---|---|
| Accuracy = 99%, but 99% majority class | Predicting majority class only | Use balanced accuracy, F1, AUC |
| AUC = 0.95, but precision = 0.1 | Imbalanced data; low positive predictive value | Report precision-recall curve + F1 |
| Training accuracy = 1.0, test = 0.7 | Severe overfitting | Reduce model complexity, add regularization |
| All samples predicted as one class | Model collapsed | Check class weights, data balance, learning rate |
Regression Red Flags¶
| Red Flag | Problem | Solution |
|---|---|---|
| RΒ² = 0.95, but predictions = mean(y) Β± noise | Overfitting to noise | Check cross-validated QΒ²; reduce features |
| RMSE < measurement noise | Perfect fit (suspicious) | Check for data leakage or duplicate samples |
| Predictions outside physical range | Model extrapolating | Add domain constraints (e.g., clip to [0, 100]%) |
| Residuals strongly patterned (not random) | Model misspecification | Add nonlinear terms, interaction terms |
Statistical Red Flags¶
| Red Flag | Problem | Solution |
|---|---|---|
| p = 0.049 (just below 0.05) | p-hacking or cherry-picking | Report effect size + CI; pre-register analysis |
| Significant p-value but Cohen's d < 0.2 | Large sample, trivial effect | Focus on effect size, not p-value |
| Non-significant but Cohen's d > 0.8 | Underpowered study | Increase sample size; report CI on effect size |
| 50 tests, 3 "significant" at Ξ±=0.05 | False positives by chance | Apply Bonferroni or FDR correction |
Domain-Specific Thresholds¶
Food Spectroscopy (FTIR/Raman/NIR)¶
| Application | Metric | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Oil Authentication | AUC | >0.85 | >0.90 | >0.95 |
| Adulteration Detection | Recall | >0.90 | >0.95 | >0.98 |
| Quantitative Analysis (e.g., fat%) | RMSE | <1% | <0.5% | <0.2% |
| Heating Quality | RΒ² | >0.70 | >0.80 | >0.90 |
| Batch QC | Specificity | >0.95 | >0.98 | >0.99 |
Compliance & Regulatory Standards¶
| Regulation | Requirement | FoodSpec Metric | Threshold |
|---|---|---|---|
| FDA 21 CFR Part 11 | Validated methods | Cross-validated AUC | >0.90 |
| ISO 17025 | Measurement uncertainty | RMSE / Ο_reference | <0.20 |
| AOAC Guidelines | Recovery | Recall (sensitivity) | >0.95 |
| EU Regulation 2017/625 | False positive rate | 1 - Specificity | <0.05 |
Metric Selection Guide¶
Choose Metrics Based on Task¶
| Task | Primary Metric | Secondary Metrics | Avoid |
|---|---|---|---|
| Binary Classification (balanced) | AUC-ROC, F1 | Accuracy, MCC | Precision or recall alone |
| Binary Classification (imbalanced) | AUC-PR, Balanced Accuracy | F1, MCC, Cohen's Kappa | Accuracy |
| Multiclass Classification | Macro F1, Cohen's Kappa | Per-class F1, confusion matrix | Micro F1 (same as accuracy) |
| Quantitative Regression | RΒ², RMSE | MAE, QΒ² (CV) | MAPE (if y=0 possible) |
| Mixture Analysis | RMSE, MAPE | RΒ², Bland-Altman | Single-point accuracy |
| Hypothesis Testing | p-value + Effect size | Confidence intervals | p-value alone |
Reporting Checklist¶
When reporting metrics, ALWAYS include:
- [x] Primary metric with 95% CI (e.g., AUC = 0.92 [0.87, 0.96])
- [x] Sample size (n_train, n_test, n_folds if CV)
- [x] Class balance (for classification: ratio, counts per class)
- [x] Validation strategy (e.g., stratified 5-fold CV, holdout 20%)
- [x] Effect size (for hypothesis tests: Cohen's d, Ξ·Β², etc.)
- [x] Context interpretation ("Excellent for balanced data" or "Fair given small n")
- [x] Comparison to baseline (e.g., "20% improvement over random classifier")
References¶
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge.
- Eilers, P. H., & Boelens, H. F. (2005). Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report.
- Japkowicz, N., & Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129-133.
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
See Also¶
- Metrics & Evaluation β Overview of all metrics
- Model Evaluation β Cross-validation strategies
- Hypothesis Testing β Statistical tests and p-values
- Statistical Power β Sample size planning
- T-tests & Effect Sizes β Detailed effect size guide