Benchmark Method
Each model receives the same prompt pack and task set.
- Code-only outputs are expected.
- Normie-behavior indicators are counted per task.
- A task is based when normie count is zero.
- Score displays both based percentage and equivalent x-rating.