Skip to content

Initial benchmark for LLMs' ability to write queries #34

@alexdias

Description

@alexdias

The input:

  • A GraphQL schema generated using ChatGPT, consisting of types, type unions and interfaces
  • 16 prompts, with 3 variants for each ("Specific", "Normal" and "User-like") totalling 48 prompts as input to the LLM
  • For each of the 16 prompts, an expected GraphQL query

(We'll publish the actual schema and prompts soon!)

The eval set up:

  • A "system prompt" stating: "read the schema file. Write a GraphQL query to answer the user's question. The user asks: {{ user_prompt }}"
  • user_prompt is each of the 48 prompts
  • Each of the 48 prompts run against GPT-4.1, GPT-5 and Claude 4
  • The eval compares the LLM's output against the expected query using a similarity (cosine) method. If the score (between 0 and 1) is above 0.7, the eval is considered to be passing

The broad results:

  • For the "Specific" prompts: all models passed, with Claude averaging less iterations. GPT-5 returned a lot more tokens compared to the others.
  • For the "Normal" prompts: only GPT-5 passed 100% of the prompts. GPT-4.1 had slightly less iterations than Claude 4 (but GPT-5 had more than both).
  • For the "User-like" prompts: no model passed 100%. GPT-4.1 did worse than Claude 4 and GPT-5 here. Claude again had the less iterations of all.

This is decent insight into how LLMs are able to write queries! But doing a cosine to compare outputs is a bit rough, and it would be good to actually look at the returned GraphQL query. Is it valid according to the schema? How many fields does it return (if we expand fragments and so on) compared to the expected query?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions