Initial benchmark for LLMs' ability to write queries

The input:
- A GraphQL schema generated using ChatGPT, consisting of types, type unions and interfaces
- 16 prompts, with 3 variants for each  ("Specific", "Normal" and "User-like") totalling 48 prompts as input to the LLM
- For each of the 16 prompts, an expected GraphQL query

(We'll publish the actual schema and prompts soon!)

The eval set up:
- A "system prompt" stating: "read the schema file. Write a GraphQL query to answer the user's question. The user asks: {{ user_prompt }}"
- user_prompt is each of the 48 prompts
- Each of the 48 prompts run against GPT-4.1, GPT-5 and Claude 4
- The eval compares the LLM's output against the expected query using a similarity (cosine) method. If the score (between 0 and 1) is above 0.7, the eval is considered to be passing

The broad results:
- For the "Specific" prompts: all models passed, with Claude averaging less iterations. GPT-5 returned a lot more tokens compared to the others.
- For the "Normal" prompts: only GPT-5 passed 100% of the prompts. GPT-4.1 had slightly less iterations than Claude 4 (but GPT-5 had more than both).
- For the "User-like" prompts: no model passed 100%. GPT-4.1 did worse than Claude 4 and GPT-5 here. Claude again had the less iterations of all.

This is decent insight into how LLMs are able to write queries! But doing a cosine to compare outputs is a bit rough, and it would be good to actually look at the returned GraphQL query. Is it valid according to the schema? How many fields does it return (if we expand fragments and so on) compared to the expected query?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial benchmark for LLMs' ability to write queries #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Initial benchmark for LLMs' ability to write queries #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions