papers/zen-code-benchmarks.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{color}
\usepackage{booktabs}
\usepackage{float}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{geometry}
\geometry{margin=1in}
\definecolor{zenblue}{RGB}{41,121,255}
\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\title{\textbf{Comprehensive Code Intelligence Benchmarking for Zen}\\
\large Technical Report v2025.09}
\author{Zach Kelling \\ Zen LM Research Team\\
\texttt{research@zenlm.org}}
\date{September 2025}

\begin{document}
\maketitle

\begin{abstract}
We present a comprehensive evaluation of Zen models on code intelligence tasks,
spanning function-level synthesis (HumanEval, MBPP), repository-level engineering
(SWE-bench), cross-file completion (RepoBench), code review, test generation, and
security vulnerability scanning. Zen-32B achieves 87.4\% on HumanEval (pass@1),
81.2\% on MBPP, 34.8\% on SWE-bench Verified, and 79.3\% on RepoBench. We provide
detailed analysis by programming language, task type, and repository size, and
introduce a real-world coding agent evaluation protocol that extends beyond
function-level benchmarks. Security analysis shows 68.4\% detection rate on
known CWE vulnerabilities.
\end{abstract}

\section{Introduction}

Code intelligence evaluation has evolved from simple function synthesis to
repository-level tasks requiring understanding of project structure, dependencies,
and conventions. The Zen code benchmarking suite covers this full spectrum:

\begin{itemize}
  \item \textbf{Function-level}: Generate functions from docstrings (HumanEval, MBPP)
  \item \textbf{Repository-level}: Resolve GitHub issues requiring multi-file edits (SWE-bench)
  \item \textbf{Completion}: Cross-file code completion given repository context (RepoBench)
  \item \textbf{Code review}: Identify bugs and suggest improvements
  \item \textbf{Test generation}: Write unit tests achieving high coverage
  \item \textbf{Security}: Detect and explain vulnerabilities (CWE taxonomy)
\end{itemize}

Each dimension tests different capabilities. Function synthesis tests local
reasoning; repository-level tasks test project understanding and code navigation.

\section{Evaluation Infrastructure}

\subsection{Execution Environment}

All code evaluations use sandboxed Docker containers with:
\begin{itemize}
  \item Language runtimes: Python 3.11, Node.js 20, Go 1.22, Rust 1.78, Java 21
  \item Timeout: 10 seconds per test case, 5 minutes per problem
  \item No network access (prevents web API shortcuts)
  \item Deterministic test seeds for reproducibility
\end{itemize}

\subsection{Pass@k Estimation}

Following Chen et al.~\cite{chen2021humaneval}, we use the unbiased pass@k estimator:

\begin{equation}
\text{pass@}k = \mathbb{E}_{\text{problems}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]
\end{equation}

where $n$ is the number of samples per problem and $c$ is the number that pass all tests.
We use $n=200$ samples for pass@1 estimation and $n=20$ for pass@10.

\section{Function-Level Benchmarks}

\subsection{HumanEval}

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Model} & \textbf{pass@1} & \textbf{pass@10} & \textbf{pass@100} \\
\midrule
Zen-7B & 78.4\% & 91.2\% & 97.8\% \\
Zen-32B & \textbf{87.4\%} & \textbf{95.1\%} & \textbf{98.9\%} \\
\bottomrule
\end{tabular}
\caption{HumanEval results. 164 Python function synthesis problems.}
\end{table}

\subsubsection{Error Analysis on HumanEval}

Among Zen-7B failures (21.6\% of problems at pass@1):

\begin{table}[H]
\centering
\begin{tabular}{lr}
\toprule
\textbf{Failure Mode} & \textbf{Fraction} \\
\midrule
Edge case handling (empty input, None, overflow) & 38\% \\
Off-by-one errors & 22\% \\
Wrong algorithm (correct on examples, wrong in general) & 18\% \\
Syntax/runtime error & 12\% \\
Incomplete implementation & 10\% \\
\bottomrule
\end{tabular}
\caption{HumanEval failure mode distribution (Zen-7B, pass@1 failures).}
\end{table}

\subsection{MBPP}

MBPP (Mostly Basic Programming Problems) contains 500 Python problems with an
average of 3 test cases per problem:

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Model} & \textbf{pass@1} & \textbf{pass@3} & \textbf{Difficulty Level} \\
\midrule
Zen-7B & 74.8\% & 82.4\% & Easy: 94.1\%, Hard: 52.3\% \\
Zen-32B & \textbf{81.2\%} & \textbf{88.7\%} & Easy: 97.2\%, Hard: 61.8\% \\
\bottomrule
\end{tabular}
\caption{MBPP results split by difficulty.}
\end{table}

\section{Repository-Level: SWE-bench}

\subsection{Task Definition}

SWE-bench~\cite{jimenez2024swebench} presents real GitHub issues from popular Python
repositories. The model must generate a patch (unified diff) that resolves the issue
and passes the associated test suite. SWE-bench Verified contains 500 carefully
validated issue-patch pairs.

\subsection{Agentic Evaluation Protocol}

We evaluate Zen in an agentic setup where the model can:
\begin{enumerate}
  \item Read repository files
  \item Search for relevant code patterns
  \item Run tests to verify hypotheses
  \item Apply patch edits iteratively
\end{enumerate}

The agent is given a 30-step budget and access to repository via a file system tool.

\begin{algorithm}[H]
\caption{SWE-bench Agentic Resolution}
\begin{algorithmic}[1]
\REQUIRE Issue $I$, repository $R$, step budget $B=30$
\ENSURE Patch $P$
\STATE Read issue $I$ and relevant files from $R$
\FOR{step $= 1 \ldots B$}
  \STATE $a \leftarrow \pi(I, \text{history})$ \COMMENT{Generate next action}
  \STATE \textbf{if} $a$ is ReadFile: read specified file
  \STATE \textbf{if} $a$ is SearchCode: run ripgrep on $R$
  \STATE \textbf{if} $a$ is RunTests: execute test suite, observe output
  \STATE \textbf{if} $a$ is EditFile: apply diff to $R$
  \STATE \textbf{if} $a$ is Submit: generate patch $P$, \textbf{break}
\ENDFOR
\RETURN $P$
\end{algorithmic}
\end{algorithm}

\subsection{Results}

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Model} & \textbf{Resolved (\%)} & \textbf{Avg. Steps} & \textbf{Token Budget} \\
\midrule
Zen-7B (agentic) & 28.4\% & 18.2 & 32K \\
Zen-32B (agentic) & \textbf{34.8\%} & 21.4 & 64K \\
Zen-7B (non-agentic) & 12.1\% & 1 & 16K \\
\bottomrule
\end{tabular}
\caption{SWE-bench Verified results. Agentic setup substantially outperforms one-shot.}
\end{table}

\subsubsection{Performance by Repository}

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Repository} & \textbf{Issues} & \textbf{Resolved (Zen-32B)} \\
\midrule
django & 106 & 38.7\% \\
scikit-learn & 72 & 36.1\% \\
matplotlib & 52 & 28.8\% \\
pytest & 40 & 42.5\% \\
sympy & 96 & 31.2\% \\
requests & 30 & 40.0\% \\
Flask & 21 & 38.1\% \\
Others & 83 & 27.7\% \\
\bottomrule
\end{tabular}
\caption{SWE-bench Verified resolution rate by repository (Zen-32B).}
\end{table}

\section{Cross-File Completion: RepoBench}

\subsection{Task Definition}

RepoBench evaluates code completion that requires retrieving and using context from
other files in the same repository. Each problem specifies a file, a cursor position,
and a ground-truth completion token that requires cross-file context.

\begin{table}[H]
\centering
\begin{tabular}{lccccc}
\toprule
\textbf{Model} & \textbf{Python} & \textbf{Java} & \textbf{TypeScript} & \textbf{Go} & \textbf{Avg.} \\
\midrule
Zen-7B & 76.4\% & 74.2\% & 72.8\% & 78.1\% & 75.4\% \\
Zen-32B & \textbf{81.8\%} & \textbf{79.4\%} & \textbf{77.3\%} & \textbf{82.6\%} & \textbf{80.3\%} \\
\bottomrule
\end{tabular}
\caption{RepoBench exact match accuracy by language.}
\end{table}

\section{Language-Specific Performance}

We evaluate HumanEval-style benchmarks in multiple programming languages:

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Language} & \textbf{Zen-7B pass@1} & \textbf{Zen-32B pass@1} & \textbf{Benchmark} & \textbf{Problems} \\
\midrule
Python & 78.4\% & 87.4\% & HumanEval & 164 \\
JavaScript & 74.1\% & 83.8\% & HumanEval-JS & 164 \\
TypeScript & 72.3\% & 82.1\% & MultiPL-E & 164 \\
Rust & 68.4\% & 78.3\% & MultiPL-E & 164 \\
Go & 71.2\% & 80.9\% & MultiPL-E & 164 \\
Java & 73.8\% & 83.2\% & MultiPL-E & 164 \\
C++ & 75.1\% & 84.3\% & MultiPL-E & 164 \\
\midrule
Average & 73.3\% & 82.9\% & -- & -- \\
\bottomrule
\end{tabular}
\caption{Cross-language code synthesis performance.}
\end{table}

Rust and Go show lower performance due to stricter memory safety requirements and
more complex type systems. Performance correlates with training data volume per language.

\section{Test Generation}

We evaluate test generation quality by:
\begin{enumerate}
  \item Giving the model a function signature and implementation
  \item Asking the model to write unit tests
  \item Measuring line coverage and mutation score
\end{enumerate}

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Line Coverage} & \textbf{Branch Coverage} & \textbf{Mutation Score} & \textbf{Tests/Function} \\
\midrule
Zen-7B & 78.3\% & 71.4\% & 62.1\% & 4.8 \\
Zen-32B & \textbf{84.1\%} & \textbf{77.8\%} & \textbf{68.4\%} & 6.2 \\
\bottomrule
\end{tabular}
\caption{Test generation quality metrics on 500 Python functions.}
\end{table}

\section{Security Vulnerability Detection}

\subsection{Benchmark Design}

We construct a security benchmark of 500 code snippets with known CWE vulnerabilities,
drawn from CVE database examples and deliberately introduced vulnerabilities:

\begin{table}[H]
\centering
\begin{tabular}{lrr}
\toprule
\textbf{CWE Category} & \textbf{Samples} & \textbf{Zen-32B Detection} \\
\midrule
CWE-89: SQL Injection & 80 & 91.2\% \\
CWE-79: XSS & 60 & 88.3\% \\
CWE-78: OS Command Injection & 50 & 84.0\% \\
CWE-22: Path Traversal & 50 & 82.0\% \\
CWE-125: Out-of-bounds Read & 60 & 61.7\% \\
CWE-476: NULL Pointer Deref & 40 & 57.5\% \\
CWE-416: Use After Free & 40 & 52.5\% \\
CWE-190: Integer Overflow & 60 & 48.3\% \\
CWE-798: Hardcoded Credentials & 60 & 83.3\% \\
\midrule
Overall & 500 & \textbf{68.4\%} \\
\bottomrule
\end{tabular}
\caption{Security vulnerability detection by CWE category.}
\end{table}

Zen models perform well on injection vulnerabilities (SQL, XSS, command injection)
which are pattern-recognizable from training data. Memory safety issues (use-after-free,
integer overflow) in C/C++ are harder, reflecting the lower volume of C security
examples in training data relative to Python web security content.

\section{Analysis}

\subsection{Scaling Laws for Code}

Code performance scales with model size following a similar law to natural language:

\begin{equation}
\text{HumanEval}(N) \approx 87.4 - 31.2 \cdot N^{-0.12}
\end{equation}

where $N$ is the number of active parameters. The exponent (0.12) is slightly larger
than for MMLU (0.095), suggesting code tasks benefit more from scale than general
knowledge retrieval.

\subsection{Context Length and Repository Tasks}

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Context Budget} & \textbf{RepoBench Accuracy} & \textbf{SWE-bench Resolved} \\
\midrule
8K & 71.2\% & 18.3\% \\
16K & 76.4\% & 26.1\% \\
32K & 79.3\% & 32.4\% \\
64K & 80.1\% & 34.8\% \\
128K & 80.4\% & 35.2\% \\
\bottomrule
\end{tabular}
\caption{Impact of context budget on repository-level tasks (Zen-32B).}
\end{table}

Repository tasks show strong gains up to 64K context, with diminishing returns beyond.
Most repositories contain under 100 relevant files; key context fits within 64K tokens.

\section{Conclusion}

Zen models achieve strong performance across the code intelligence spectrum, from
87.4\% HumanEval pass@1 to 34.8\% SWE-bench resolution. The gap between function-level
and repository-level performance reflects the additional challenges of project
navigation, multi-file reasoning, and test-driven development required for real-world
coding tasks. Security vulnerability detection shows the strongest performance on
injection-class vulnerabilities; memory safety in low-level languages remains a
frontier for improvement.

\bibliographystyle{plain}
\begin{thebibliography}{99}
\bibitem{chen2021humaneval} Chen et al. (2021). Evaluating Large Language Models Trained on Code. \textit{arXiv:2107.03374}.
\bibitem{jimenez2024swebench} Jimenez et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? \textit{ICLR}.
\bibitem{liu2023repobench} Liu et al. (2023). RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. \textit{arXiv:2306.03091}.
\bibitem{austin2021mbpp} Austin et al. (2021). Program Synthesis with Large Language Models. \textit{arXiv:2108.07732}.
\bibitem{cassano2022multipl} Cassano et al. (2022). MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. \textit{arXiv:2208.08227}.
\end{thebibliography}

\end{document}