Benchmark shared UI schema with multiple popular models including non-Google ones

Let's make sure the schema we're working with can perform well with multiple models from Google, Anthropic, Open AI etc.

The things to test are:
- Can accept schema in structured output mode successfully, i.e. doesn't use unsupported features
- Can generate valid UI responses for a range of sample UI use cases where validity means:
   - Only refers to valid widgets in the catalog
   - Tree structure is valid with ID references etc
   - UI structure is reasonable given the use case. Harder to evaluate objectively, but can check that it includes key pieces of information, uses key widgets
- Can do the above in more scaled use cases, e.g. in a case with a large number of custom widgets (say 50-100), or with longer, deeply nested outputs, e.g. a full screen view with a lot of nested components.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark shared UI schema with multiple popular models including non-Google ones #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark shared UI schema with multiple popular models including non-Google ones #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions