Basque LLM Evaluation

A public benchmark dashboard for comparing local LLM performance in Basque language tasks.

Table 1. Classification results

#	Model	Quantization	Overall	Evals (grouped)

Accuracy reported as mean ± std across random seeds.

Figures — Classification

Figure 1. Overall accuracy by model

Figure 2. Accuracy profile by classification eval

Skill cards — Best models by skill

Winner score and margin over the runner-up for each skill.

Skill view — Ranking by selected skill

—

#	Model	Skill score	Benchmarks used

Figure 3. Translation heatmap (chrF / BLEU)

Cell color is scaled by score within each metric (chrF and BLEU independently). Higher is better.

Figures — Timeline

Figure 4. Overall accuracy by release date

Evaluation protocol

Family	Benchmark	What is measured	Metric	Label space

Methodology

Aspect	Detail
Models evaluated	11
Classification benchmarks	12 (BasqueGLUE, LatxaEval, EusTrivia, XNLIeu, MMLU, BertaQA, MGSM)
Translation benchmarks	4 (FLORES: EU↔EN, EU↔ES)
Items per benchmark	80
Classification items per model	960 (12 × 80)
Translation items per model	320 (4 × 80)
Seeds	42, 123, 777
Evaluations per model (classification)	3 seeds × 80 items = 240 per benchmark
Classification metric	Accuracy
Translation metric	chrF / BLEU
Averaging	Equal-weight mean across benchmarks and seeds
DeepSeek V4 Pro	Additional temperature sweep (0.3 / 0.7) per seed, totalling 6 runs

Scores are computed as equal-weight averages across benchmarks within a family or skill. Error bars show ±1 standard deviation across seeds. Models are sorted by overall classification accuracy. Translation benchmarks are reported separately due to different metrics (chrF/BLEU).