generated from graphql/wg-template
-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
The input:
- A GraphQL schema generated using ChatGPT, consisting of types, type unions and interfaces
- 16 prompts, with 3 variants for each ("Specific", "Normal" and "User-like") totalling 48 prompts as input to the LLM
- For each of the 16 prompts, an expected GraphQL query
(We'll publish the actual schema and prompts soon!)
The eval set up:
- A "system prompt" stating: "read the schema file. Write a GraphQL query to answer the user's question. The user asks: {{ user_prompt }}"
- user_prompt is each of the 48 prompts
- Each of the 48 prompts run against GPT-4.1, GPT-5 and Claude 4
- The eval compares the LLM's output against the expected query using a similarity (cosine) method. If the score (between 0 and 1) is above 0.7, the eval is considered to be passing
The broad results:
- For the "Specific" prompts: all models passed, with Claude averaging less iterations. GPT-5 returned a lot more tokens compared to the others.
- For the "Normal" prompts: only GPT-5 passed 100% of the prompts. GPT-4.1 had slightly less iterations than Claude 4 (but GPT-5 had more than both).
- For the "User-like" prompts: no model passed 100%. GPT-4.1 did worse than Claude 4 and GPT-5 here. Claude again had the less iterations of all.
This is decent insight into how LLMs are able to write queries! But doing a cosine to compare outputs is a bit rough, and it would be good to actually look at the returned GraphQL query. Is it valid according to the schema? How many fields does it return (if we expand fragments and so on) compared to the expected query?
Metadata
Metadata
Assignees
Labels
No labels