Skip to content

Commit 5b59fdb

Browse files
crisbetodevversionAndrewKushnir
committed
feat: add web-codegen-scorer
Adds the initial implementation of the Web Codegen Scorer. Co-authored-by: Paul Gschwendtner <[email protected]> Co-authored-by: Andrew Kushnir <[email protected]>
1 parent 51f89df commit 5b59fdb

File tree

236 files changed

+50425
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

236 files changed

+50425
-0
lines changed

.github/workflows/build.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
on:
2+
push:
3+
branches:
4+
- main
5+
pull_request:
6+
types: [opened, synchronize, reopened]
7+
8+
jobs:
9+
build:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
13+
- uses: actions/setup-node@d7a11313b581b306c961b506cfc8971208bb03f6
14+
with:
15+
node-version: 24
16+
- uses: pnpm/action-setup@f2b2b233b538f500472c7274c7012f57857d8ce0
17+
with:
18+
version: 9
19+
- run: pnpm i --frozen-lockfile
20+
- run: pnpm check-format
21+
- run: pnpm release-build

.gitignore

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
dist
2+
.tmp/
3+
.report-migration-backups
4+
.DS_Store
5+
.vscode
6+
safety-web.log
7+
node_modules
8+
9+
report-app/node_modules
10+
report-app/dist
11+
report-app/.angular
12+
report-app/.vscode
13+
report-app/.reports
14+
15+
.web-codegen-scorer

.prettierrc.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"semi": true,
3+
"trailingComma": "es5",
4+
"singleQuote": true,
5+
"printWidth": 80,
6+
"tabWidth": 2,
7+
"overrides": [
8+
{
9+
"files": "*.html",
10+
"options": {
11+
"parser": "angular"
12+
}
13+
}
14+
]
15+
}

README.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Web Codegen Scorer
2+
3+
This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs).
4+
5+
## Documentation directory
6+
7+
- [Environment config reference](./docs/environment-reference.md)
8+
- [How to set up a new model?](./docs/model-setup.md)
9+
10+
## Setup
11+
12+
1. **Install the package:**
13+
```bash
14+
npm install -g web-codegen-scorer
15+
```
16+
17+
2. **Set up your API keys:**
18+
In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
19+
```bash
20+
export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
21+
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
22+
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
23+
```
24+
25+
3. **Run an eval:**
26+
You can run your first eval using our Angular example with the following command:
27+
```bash
28+
web-codegen-scorer eval --env=angular-example
29+
```
30+
31+
4. (Optional) **Set up your own eval:**
32+
If you want to set up a custom eval, instead of using our built-in examples, you can run the following
33+
command which will guide you through the process:
34+
35+
```bash
36+
web-codegen-scorer init
37+
```
38+
39+
## Command-line flags
40+
41+
You can customize the `web-codegen-scorer eval` script with the following flags:
42+
43+
- `--env=<path>` (alias: `--environment`): (**Required**) Specifies the path from which to load the environment config.
44+
- Example: `web-codegen-scorer eval --env=foo/bar/my-env.js`
45+
46+
- `--model=<name>`: Specifies the model to use when generating code. Defaults to the value of `DEFAULT_MODEL_NAME`.
47+
- Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>`
48+
49+
- `--runner=<name>`: Specifies the runner to use to execute the eval. Supported runners are `genkit` (default) or `gemini-cli`.
50+
51+
- `--local`: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the `.llm-output` directory (e.g., `.llm-output/todo-app.ts`). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.
52+
- **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to generate the initial files in `.llm-output`.
53+
- The `web-codegen-scorer eval:local` script is a shortcut for `web-codegen-scorer eval --local`.
54+
55+
- `--limit=<number>`: Specifies the number of application prompts to process. Defaults to `5`.
56+
- Example: `web-codegen-scorer eval --limit=10 --env=<config path>`
57+
58+
- `--output-directory=<name>` (alias: `--output-dir`): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory.
59+
- Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>`
60+
61+
- `--concurrency=<number>`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`).
62+
- Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>`
63+
64+
- `--report-name=<name>`: Sets the name for the generated report directory. Defaults to a timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric characters replaced with hyphens).
65+
- Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>`
66+
67+
- `--rag-endpoint=<url>`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a `PROMPT` substring, which will be replaced with the user prompt.
68+
- Example: `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>`
69+
70+
- `--prompt-filter=<name>`: String used to filter which prompts should be run. By default a random sample (controlled by `--limit`) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.
71+
- Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>`
72+
73+
- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to `false`.
74+
- Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>`
75+
76+
- `--labels=<label1> <label2>`: Metadata labels that will be attached to the run.
77+
- Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>`
78+
79+
- `--mcp`: Whether to start an MCP for the evaluation. Defaults to `false`.
80+
- Example: `web-codegen-scorer eval --mcp --env=<config path>`
81+
82+
- `--help`: Prints out usage information about the script.
83+
84+
## Local development
85+
86+
If you've cloned this repo and want to work on the tool, you have to install its dependencies by running `pnpm install`.
87+
Once they're installed, you can run the following commands:
88+
89+
* `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm.
90+
* `pnpm run eval` - Runs an eval from source.
91+
* `pnpm run report` - Runs the report app from source.
92+
* `pnpm run init` - Runs the init script from source.
93+
* `pnpm run format` - Formats the source code using Prettier.

docs/environment-reference.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Environment configuration reference
2+
3+
Environments are configured by creating a `config.js` that exposes an object that satisfies the
4+
`EnvironmentConfig` interface. This document covers all the possible options in `EnvironmentConfig`
5+
and what they do.
6+
7+
## Required properties
8+
9+
These properties all have to be specified in order for the environment to function
10+
11+
### `displayName`
12+
13+
Human-readable name that will be shown in eval reports about this environment.
14+
15+
### `id`
16+
17+
Unique ID for the environment. If ommitted, one will be generated from the `displayName`.
18+
19+
### `clientSideFramework`
20+
21+
ID of the client-side framework that the environment will be running, for example `angular`.
22+
23+
### `ratings`
24+
25+
An array defining the ratings that will be executed as a part of the evaluation.
26+
The ratings determine what score that will be assigned to the test run.
27+
Currently we support the following types of ratings:
28+
29+
- `PerBuildRating` - assigns a score based on the build result of the generated code, e.g.
30+
"Does it build on the first run?" or "Does it build after X repair attempts?"
31+
- `PerFileRating` - assigns a score based on the content of individual files generated by the LLM.
32+
Can be run either against all file types by setting the `filter` to
33+
`PerFileRatingContentType.UNKNOWN` or against specific files.
34+
- `LLMBasedRating` - rates the generated code by asking an LLM to assign a score to it,
35+
e.g. "Does this app match the specified prompts?"
36+
37+
### `packageManager`
38+
39+
Name of the package manager to use to install dependencies for the evaluated code.
40+
Supports `npm`, `pnpm` and `yarn`. Defaults to `npm`.
41+
42+
### `generationSystemPrompt`
43+
44+
Relative path to the system instructions that should be passed to the LLM when generating code.
45+
46+
### `repairSystemPrompt`
47+
48+
Relative path to the system instructions that should be passed to the LLM when repairing failures.
49+
50+
### `executablePrompts`
51+
52+
Configures the prompts that should be evaluated against the environment. Can contain either strings
53+
which represent glob patterns pointing to text files with the prompt's text
54+
(e.g. `./prompts/**/*.md`) or `MultiStepPrompt` objects ([see below](#multi-step-prompts)).
55+
The prompts can be shared between environments
56+
(e.g. `executablePrompts: ['../some-other-env/prompts/**/*.md']`).
57+
58+
### `classifyPrompts`
59+
60+
When enabled, the system prompts for this environment won't be included in the final report.
61+
This is useful when evaluating confidential code.
62+
63+
### `skipInstall`
64+
65+
Whether to skip installing dependencies during the eval run. This can be useful if you've already
66+
ensured that all dependencies are installed through something like pnpm workspaces.
67+
68+
### Prompt templating
69+
70+
Prompts are typically stored in `.md` files. We support the following template syntax inside of
71+
these files in order to augment the prompt and reduce boilerplate:
72+
73+
- `{{> embed file='../path/to/file.md' }}` - embeds the content of the specified file in the
74+
current one.
75+
- `{{> contextFiles '**/*.foo' }}` - specifies files that should be passed to the LLM as context
76+
when the prompt is executed. Should be a comma-separated string of glob pattern **within** the
77+
environments project code. E.g. `{{> contextFiles '**/*.ts, **/*.html' }}` will pass all `.ts`
78+
and `.html` files as context.
79+
- `{{CLIENT_SIDE_FRAMEWORK_NAME}}` - insert the name of the client-side framework of the current
80+
environment.
81+
- `{{FULL_STACK_FRAMEWORK_NAME}}` - insert the name of the full-stack framework of the current
82+
environment.
83+
84+
### Prompt-specific ratings
85+
86+
If you want to run a set of ratings against a specific prompt, you can set an object literal
87+
in the `executablePrompts` array, instead of a string:
88+
89+
```ts
90+
executablePrompts: [
91+
// Runs only with the environment-level ratings.
92+
'./prompts/foo/*.md',
93+
94+
// Runs the ratings specific to the `contact-form.md`, as well as the environment-level ones.
95+
{
96+
path: './prompts/bar/contact-form.md',
97+
ratings: contactFormSpecificRatings,
98+
},
99+
];
100+
```
101+
102+
### Multi-step prompts
103+
104+
Multi-step prompts are prompts meant to evaluate workflows made up of one or more stages.
105+
Steps execute one after another **inside the same directory**, but are rated individually and
106+
snapshots after each step are stored in the final report. You can create a multi-step prompt by
107+
passing an instrance of the `MultiStepPrompt` class into the `executablePrompts` array, for example:
108+
109+
```ts
110+
executablePrompts: [
111+
new MultiStepPrompt('./prompts/about-page', {
112+
'step-1': ratingsForFirstStep,
113+
'step-2': [...ratingsForFirstStep, ratingsForSecondStep],
114+
}),
115+
];
116+
```
117+
118+
The first parameter is the directory from which to resolve the individual step prompts.
119+
All files in the directory **have to be named `step-{number}.md`**, for example:
120+
121+
**my-env/prompts/about-page/step-1.md:**
122+
123+
```
124+
Create an "About us" page.
125+
```
126+
127+
**my-env/prompts/about-page/step-2.md:**
128+
129+
```
130+
Add a contact form to the "About us" page
131+
```
132+
133+
**my-env/prompts/about-page/step-3.md:**
134+
135+
```
136+
Make it so submitting the contact form redirects the user back to the homepage.
137+
```
138+
139+
The second parameter of `MultiStepPrompt` defines ratings that should be run only against specific
140+
steps. The key is the name of the step (e.g. `step-2`) while the value are the ratings that should
141+
run against it.
142+
143+
## Optional properties
144+
145+
These properties aren't required for the environment to run, but can be used to configure it further.
146+
147+
### `sourceDirectory`
148+
149+
Project into which the LLM-generated files will be placed, built, executed and evaluated.
150+
Can be an entire project or a handful of files that will be merged with the
151+
`projectTemplate` ([see below](#projecttemplate))
152+
153+
### `projectTemplate`
154+
155+
Used for reducing the boilerplate when setting up an environment, `projectTemplate` specifies the
156+
path of the project template that will be merged together with the files from `sourceDirectory` to
157+
create the final project structure that the evaluation will run against.
158+
159+
For example, if the config has `projectTemplate: './templates/angular', sourceDirectory: './project'`,
160+
the eval runner will copy the files from `./templates/angular` into the output directory
161+
and then apply the files from `./project` on top of them, merging directories and replacing
162+
overlapping files.
163+
164+
### `fullStackFramework`
165+
166+
Name of the full-stack framework that is used in the evaluation, in addition to the
167+
`clientSideFramework`. If omitted, the `fullStackFramework` will be set to the same value as
168+
the `clientSideFramework`.
169+
170+
### `mcpServers`
171+
172+
IDs of Model Context Protocol servers that will be started and exposed to the LLM as a part of
173+
the evaluation.
174+
175+
### `buildCommand`
176+
177+
Command used to build the generated code as a part of the evaluation.
178+
Defaults to `<package manager> run build`.
179+
180+
### `serveCommand`
181+
182+
Command used to start a local dev server as a part of the evaluation.
183+
Defaults to `<package manager> run start --port 0`.

docs/model-setup.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# How to setup up a new LLM?
2+
3+
If you want to test out a model that isn't yet available in the runner, you can add
4+
support for it by following these steps:
5+
6+
1. Ensure that the provider of the model is supported by Genkit.
7+
2. Find the provider for the model in `runner/codegen/genkit/providers`. If the provider hasn't been implemented yet, do so by creating a new `GenkitModelProvider` and adding it to the `MODEL_PROVIDERS` in `runner/genkit/models.ts`.
8+
3. Add your model to the `GenkitModelProvider` configs.
9+
4. Done! 🎉 You can now run your model by passing `--model=<your model ID>`.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import { getBuiltInRatings } from 'web-codegen-scorer';
2+
3+
/** @type {import("web-codegen-scorer").EnvironmentConfig} */
4+
export default {
5+
displayName: 'Angular (example)',
6+
clientSideFramework: 'angular',
7+
sourceDirectory: './project',
8+
ratings: getBuiltInRatings(),
9+
generationSystemPrompt: './system-instructions.md',
10+
executablePrompts: ['../../prompts/**/*.md'],
11+
packageManager: 'npm',
12+
};
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# See https://docs.github.com/get-started/getting-started-with-git/ignoring-files for more about ignoring files.
2+
3+
# Compiled output
4+
/dist
5+
/tmp
6+
/out-tsc
7+
/bazel-out
8+
9+
# Node
10+
/node_modules
11+
npm-debug.log
12+
yarn-error.log
13+
14+
# IDEs and editors
15+
.idea/
16+
.project
17+
.classpath
18+
.c9/
19+
*.launch
20+
.settings/
21+
*.sublime-workspace
22+
23+
# Visual Studio Code
24+
.vscode/*
25+
!.vscode/settings.json
26+
!.vscode/tasks.json
27+
!.vscode/launch.json
28+
!.vscode/extensions.json
29+
.history/*
30+
31+
# Miscellaneous
32+
/.angular/cache
33+
.sass-cache/
34+
/connect.lock
35+
/coverage
36+
/libpeerconnection.log
37+
testem.log
38+
/typings
39+
40+
# System files
41+
.DS_Store
42+
Thumbs.db

0 commit comments

Comments
 (0)