Basque LLM Evaluation
A public benchmark dashboard for comparing local LLM performance in Basque language tasks.
Table 1. Classification results
Eval family
Skill category
#
Model
Quantization
Overall
Evals (grouped)
Accuracy reported as mean ± std across random seeds.
Figures — Classification
Figure 1. Overall accuracy by model
Figure 2. Accuracy profile by classification eval
Skill cards — Best models by skill
Winner score and margin over the runner-up for each skill.
Skill view — Ranking by selected skill
#
Model
Skill score
Benchmarks used
Figure 3. Translation heatmap (chrF / BLEU)
Cell color is scaled by score within each metric (chrF and BLEU independently). Higher is better.
Figures — Timeline
Figure 4. Overall accuracy by release date
Evaluation protocol
Family
Benchmark
What is measured
Metric
Label space
Methodology
Aspect
Detail
Models evaluated 11
Classification benchmarks 12 (BasqueGLUE, LatxaEval, EusTrivia, XNLIeu, MMLU, BertaQA, MGSM)
Translation benchmarks 4 (FLORES: EU↔EN, EU↔ES)
Items per benchmark 80
Classification items per model 960 (12 × 80)
Translation items per model 320 (4 × 80)
Seeds 42, 123, 777
Evaluations per model (classification) 3 seeds × 80 items = 240 per benchmark
Classification metric Accuracy
Translation metric chrF / BLEU
Averaging Equal-weight mean across benchmarks and seeds
DeepSeek V4 Pro Additional temperature sweep (0.3 / 0.7) per seed, totalling 6 runs
Scores are computed as equal-weight averages across benchmarks within a family or skill. Error bars show ±1 standard deviation across seeds. Models are sorted by overall classification accuracy. Translation benchmarks are reported separately due to different metrics (chrF/BLEU).