Basque LLM Evaluation

A public benchmark dashboard for comparing local LLM performance in Basque language tasks.

Table 1. Classification results

# Model Quantization Overall Evals (grouped)
Accuracy reported as mean ± std across random seeds.

Figures — Classification

Figure 1. Overall accuracy by model

Figure 2. Accuracy profile by classification eval

Skill cards — Best models by skill

Winner score and margin over the runner-up for each skill.

Skill view — Ranking by selected skill

# Model Skill score Benchmarks used

Figure 3. Translation heatmap (chrF / BLEU)

Cell color is scaled by score within each metric (chrF and BLEU independently). Higher is better.

Figures — Timeline

Figure 4. Overall accuracy by release date

Evaluation protocol

Family Benchmark What is measured Metric Label space

Methodology

Aspect Detail
Models evaluated11
Classification benchmarks12 (BasqueGLUE, LatxaEval, EusTrivia, XNLIeu, MMLU, BertaQA, MGSM)
Translation benchmarks4 (FLORES: EU↔EN, EU↔ES)
Items per benchmark80
Classification items per model960 (12 × 80)
Translation items per model320 (4 × 80)
Seeds42, 123, 777
Evaluations per model (classification)3 seeds × 80 items = 240 per benchmark
Classification metricAccuracy
Translation metricchrF / BLEU
AveragingEqual-weight mean across benchmarks and seeds
DeepSeek V4 ProAdditional temperature sweep (0.3 / 0.7) per seed, totalling 6 runs
Scores are computed as equal-weight averages across benchmarks within a family or skill. Error bars show ±1 standard deviation across seeds. Models are sorted by overall classification accuracy. Translation benchmarks are reported separately due to different metrics (chrF/BLEU).