1
1
# Web Codegen Scorer
2
2
3
- This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs).
3
+ ** Web Codegen Scorer** is a tool for evaluating the quality of web code generated by Large Language
4
+ Models (LLMs).
4
5
5
- ## Documentation directory
6
+ You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
6
7
7
- - [ Environment config reference] ( ./docs/environment-reference.md )
8
- - [ How to set up a new model?] ( ./docs/model-setup.md )
8
+ * 🔄 Iterate on a system prompt to find most effective instructions for your project.
9
+ * ⚖️ Compare the code quality of code produced by different models.
10
+ * 📈 Monitor generated code quality over time as models and agents evolve.
11
+
12
+ Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _ web_
13
+ code and relies primarily on well-established measures of code quality.
14
+
15
+ ## Features
16
+
17
+ * ⚙️ Configure your evaluations with different models, frameworks, and tools.
18
+ * ✍️ Specify system instructions and add MCP servers.
19
+ * 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
20
+ coding best practices. (More built-in checks coming soon!)
21
+ * 🔧 Automatically attempt to repair issues detected during code generating.
22
+ * 📊 View and compare results with an intuitive report viewer UI.
9
23
10
24
## Setup
11
25
12
- 1 . ** Install the package:**
26
+ 1 . ** Install the package:**
27
+
13
28
``` bash
14
29
npm install -g web-codegen-scorer
15
30
```
16
31
17
- 2 . ** Set up your API keys:**
18
- In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
32
+ 2 . ** Set up your API keys:**
33
+
34
+ In order to run an eval, you have to specify an API keys for the relevant providers as
35
+ environment variables:
36
+
19
37
``` bash
20
38
export GEMINI_API_KEY=" YOUR_API_KEY_HERE" # If you're using Gemini models
21
39
export OPENAI_API_KEY=" YOUR_API_KEY_HERE" # If you're using OpenAI models
22
40
export ANTHROPIC_API_KEY=" YOUR_API_KEY_HERE" # If you're using Anthropic models
23
41
```
24
42
25
43
3 . ** Run an eval:**
26
- You can run your first eval using our Angular example with the following command:
44
+
45
+ You can run your first eval using our Angular example with the following command:
46
+
27
47
``` bash
28
48
web-codegen-scorer eval --env=angular-example
29
49
```
30
50
31
51
4 . (Optional) ** Set up your own eval:**
32
- If you want to set up a custom eval, instead of using our built-in examples, you can run the following
33
- command which will guide you through the process:
52
+
53
+ If you want to set up a custom eval, instead of using our built-in examples, you can run the
54
+ following command which will guide you through the process:
34
55
35
56
``` bash
36
57
web-codegen-scorer init
@@ -40,54 +61,112 @@ web-codegen-scorer init
40
61
41
62
You can customize the ` web-codegen-scorer eval ` script with the following flags:
42
63
43
- - ` --env=<path> ` (alias: ` --environment ` ): (** Required** ) Specifies the path from which to load the environment config.
44
- - Example: ` web-codegen-scorer eval --env=foo/bar/my-env.js `
64
+ - ` --env=<path> ` (alias: ` --environment ` ): (** Required** ) Specifies the path from which to load the
65
+ environment config.
66
+ - Example: ` web-codegen-scorer eval --env=foo/bar/my-env.js `
45
67
46
- - ` --model=<name> ` : Specifies the model to use when generating code. Defaults to the value of ` DEFAULT_MODEL_NAME ` .
47
- - Example: ` web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path> `
68
+ - ` --model=<name> ` : Specifies the model to use when generating code. Defaults to the value of
69
+ ` DEFAULT_MODEL_NAME ` .
70
+ - Example: ` web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path> `
48
71
49
- - ` --runner=<name> ` : Specifies the runner to use to execute the eval. Supported runners are ` genkit ` (default) or ` gemini-cli ` .
72
+ - ` --runner=<name> ` : Specifies the runner to use to execute the eval. Supported runners are
73
+ ` genkit ` (default) or ` gemini-cli ` .
50
74
51
- - ` --local ` : Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the ` .llm-output ` directory (e.g., ` .llm-output/todo-app.ts ` ). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.
52
- - ** Note:** You typically need to run ` web-codegen-scorer eval ` once without ` --local ` to generate the initial files in ` .llm-output ` .
53
- - The ` web-codegen-scorer eval:local ` script is a shortcut for ` web-codegen-scorer eval --local ` .
75
+ - ` --local ` : Runs the script in local mode for the initial code generation request. Instead of
76
+ calling the LLM, it will attempt to read the initial code from a corresponding file in the
77
+ ` .web-codegen-scorer/llm-output ` directory (e.g., ` .web-codegen-scorer/llm-output/todo-app.ts ` ).
78
+ This is useful for re-running assessments or debugging the build/repair process without incurring
79
+ LLM costs for the initial generation.
80
+ - ** Note:** You typically need to run ` web-codegen-scorer eval ` once without ` --local ` to
81
+ generate the initial files in ` .web-codegen-scorer/llm-output ` .
82
+ - The ` web-codegen-scorer eval:local ` script is a shortcut for
83
+ ` web-codegen-scorer eval --local ` .
54
84
55
85
- ` --limit=<number> ` : Specifies the number of application prompts to process. Defaults to ` 5 ` .
56
- - Example: ` web-codegen-scorer eval --limit=10 --env=<config path> `
86
+ - Example: ` web-codegen-scorer eval --limit=10 --env=<config path> `
57
87
58
- - ` --output-directory=<name> ` (alias: ` --output-dir ` ): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory.
59
- - Example: ` web-codegen-scorer eval --output-dir=test-output --env=<config path> `
88
+ - ` --output-directory=<name> ` (alias: ` --output-dir ` ): Specifies which directory to output the
89
+ generated code under which is useful for debugging. By default, the code will be generated in a
90
+ temporary directory.
91
+ - Example: ` web-codegen-scorer eval --output-dir=test-output --env=<config path> `
60
92
61
- - ` --concurrency=<number> ` : Sets the maximum number of concurrent AI API requests. Defaults to ` 5 ` (as defined by ` DEFAULT_CONCURRENCY ` in ` src/config.ts ` ).
62
- - Example: ` web-codegen-scorer eval --concurrency=3 --env=<config path> `
93
+ - ` --concurrency=<number> ` : Sets the maximum number of concurrent AI API requests. Defaults to ` 5 ` (
94
+ as defined by ` DEFAULT_CONCURRENCY ` in ` src/config.ts ` ).
95
+ - Example: ` web-codegen-scorer eval --concurrency=3 --env=<config path> `
63
96
64
- - ` --report-name=<name> ` : Sets the name for the generated report directory. Defaults to a timestamp (e.g., ` 2023-10-27T10-30-00-000Z ` ). The name will be sanitized (non-alphanumeric characters replaced with hyphens).
65
- - Example: ` web-codegen-scorer eval --report-name=my-custom-report --env=<config path> `
97
+ - ` --report-name=<name> ` : Sets the name for the generated report directory. Defaults to a
98
+ timestamp (e.g., ` 2023-10-27T10-30-00-000Z ` ). The name will be sanitized (non-alphanumeric
99
+ characters replaced with hyphens).
100
+ - Example: ` web-codegen-scorer eval --report-name=my-custom-report --env=<config path> `
66
101
67
- - ` --rag-endpoint=<url> ` : Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a ` PROMPT ` substring, which will be replaced with the user prompt.
68
- - Example: ` web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path> `
102
+ - ` --rag-endpoint=<url> ` : Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The
103
+ URL must contain a ` PROMPT ` substring, which will be replaced with the user prompt.
104
+ - Example:
105
+ ` web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path> `
69
106
70
- - ` --prompt-filter=<name> ` : String used to filter which prompts should be run. By default a random sample (controlled by ` --limit ` ) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.
71
- - Example: ` web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path> `
107
+ - ` --prompt-filter=<name> ` : String used to filter which prompts should be run. By default, a random
108
+ sample (controlled by ` --limit ` ) will be taken from the prompts in the current environment.
109
+ Setting this can be useful for debugging a specific prompt.
110
+ - Example: ` web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path> `
72
111
73
- - ` --skip-screenshots ` : Whether to skip taking screenshots of the generated app. Defaults to ` false ` .
74
- - Example: ` web-codegen-scorer eval --skip-screenshots --env=<config path> `
112
+ - ` --skip-screenshots ` : Whether to skip taking screenshots of the generated app. Defaults to
113
+ ` false ` .
114
+ - Example: ` web-codegen-scorer eval --skip-screenshots --env=<config path> `
75
115
76
116
- ` --labels=<label1> <label2> ` : Metadata labels that will be attached to the run.
77
- - Example: ` web-codegen-scorer eval --labels my-label another-label --env=<config path> `
117
+ - Example: ` web-codegen-scorer eval --labels my-label another-label --env=<config path> `
78
118
79
119
- ` --mcp ` : Whether to start an MCP for the evaluation. Defaults to ` false ` .
80
- - Example: ` web-codegen-scorer eval --mcp --env=<config path> `
120
+ - Example: ` web-codegen-scorer eval --mcp --env=<config path> `
81
121
82
122
- ` --help ` : Prints out usage information about the script.
83
123
124
+ ### Additional configuration options
125
+
126
+ - [ Environment config reference] ( ./docs/environment-reference.md )
127
+ - [ How to set up a new model?] ( ./docs/model-setup.md )
128
+
84
129
## Local development
85
130
86
- If you've cloned this repo and want to work on the tool, you have to install its dependencies by running ` pnpm install ` .
131
+ If you've cloned this repo and want to work on the tool, you have to install its dependencies by
132
+ running ` pnpm install ` .
87
133
Once they're installed, you can run the following commands:
88
134
89
135
* ` pnpm run release-build ` - Builds the package in the ` dist ` directory for publishing to npm.
90
136
* ` pnpm run eval ` - Runs an eval from source.
91
137
* ` pnpm run report ` - Runs the report app from source.
92
138
* ` pnpm run init ` - Runs the init script from source.
93
139
* ` pnpm run format ` - Formats the source code using Prettier.
140
+
141
+ ## FAQ
142
+
143
+ ### Who built this tool?
144
+
145
+ This tool is built by the Angular team at Google.
146
+
147
+ ### Does this tool only work for Angular code or Google models?
148
+
149
+ No! You can use this tool with any web library or framework (or none at all) as well as any model.
150
+
151
+ ### Why did you build this tool?
152
+
153
+ As more and more developers reach for LLM-based tools to create and modify code, we wanted to be
154
+ able to empirically measure the effect of different factors on the quality of generated code. While
155
+ many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the
156
+ specific quality metrics we cared about.
157
+
158
+ In the absence of such a tool, we found that many developers based their judgements on codegen with
159
+ different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web
160
+ Codegen Scorer gives us a platform to consistently measure codegen across different configurations
161
+ with consistency and repeatability.
162
+
163
+ ### Will you add more features over time?
164
+
165
+ Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.
166
+
167
+ Our roadmap includes:
168
+
169
+ * Including _ interaction testing_ in the rating, to ensure the generated code performs any requested
170
+ behaviors.
171
+ * Measure Core Web Vitals.
172
+ * Measuring the effectiveness of LLM-driven edits on an existing codebase.
0 commit comments