feat: LLM EvalKit - A practical framework for building a open-source … #2341

Michael-Santoro · 2025-09-19T18:59:41Z

…application for prompt engineering

Summary

This project is a powerful tool designed for the systematic evaluation and optimization of prompts for Large Language Models (LLMs). It provides a robust Python-based framework specifically for assessing prompt performance against a defined problem, enabling users to identify the most effective prompt variations.

The core purpose of this tool is to abstract away some of the complexities of Google Cloud's Vertex AI platform. By simplifying interactions with the underlying infrastructure, it empowers users to concentrate on the critical task of prompt engineering and performance analysis.

Ultimately, this project enables teams to rigorously evaluate and deploy the best possible prompts for their applications, ensuring higher quality and more reliable results from their Large Language Models.

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Follow the CONTRIBUTING Guide.
You are listed as the author in your notebook or README file.
Your account is listed in CODEOWNERS for the file(s).
Ensure the tests and linter pass (Run nox -s format from the repository root to format).

Fixes #<issue_number_goes_here> 🦕

…application for prompt engineering

gemini-code-assist · 2025-09-19T19:00:04Z

Summary of Changes

Hello @Michael-Santoro, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a powerful new tool, the LLM EvalKit, which streamlines the entire prompt engineering process. It provides a robust framework for developers to efficiently create, test, evaluate, and optimize LLM prompts, significantly simplifying interactions with Vertex AI and enabling a more focused approach to achieving high-quality, reliable results from large language models.

Highlights

Introduction of LLM EvalKit: This pull request introduces the LLM EvalKit, a comprehensive Python-based framework designed for the systematic evaluation and optimization of prompts for Large Language Models (LLMs), abstracting complexities of Google Cloud's Vertex AI platform.
Comprehensive Streamlit UI: A new Streamlit application provides a user-friendly interface for managing the entire prompt engineering lifecycle, including prompt creation, editing, testing, dataset management, evaluation, and optimization.
Prompt Management Features: Users can now create new prompts, load and edit existing ones, test them with sample inputs, and save changes as new versions, facilitating iterative development and A/B testing.
Dataset Creation and Management: The kit includes functionality to upload and manage evaluation datasets (CSV, JSON, JSONL) directly to Google Cloud Storage, organizing them into logical 'datasets' for easy access.
Evaluation Workbench: A dedicated evaluation workbench supports both human-in-the-loop rating and automated model-based evaluations, allowing for comparison of model responses against ground truth using various metrics.
Prompt Optimization Integration: The framework integrates with Vertex AI's prompt optimization capabilities, enabling users to configure and launch automated jobs to refine and improve prompt performance based on specified datasets.
Results and Records Browsing: Users can review the results of prompt optimization jobs, compare different prompt versions, and track evaluation outcomes over time through a 'Prompt Records Leaderboard'.
Detailed Tutorial Notebook: A new Jupyter notebook provides a step-by-step tutorial for setting up and using the LLM EvalKit, guiding users through prompt management, dataset creation, evaluation, and optimization workflows.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the LLM EvalKit, a comprehensive framework for managing, evaluating, and optimizing LLM prompts. The contribution is substantial, adding a full Streamlit application with multiple pages, backend logic for interacting with Google Cloud services, and a tutorial notebook. My review focuses on ensuring adherence to the provided style guide, correctness of the implementation, and the usability of the tutorial. I've identified some critical issues, including a missing dependency and a bug in response generation, that need to be addressed. There are also several violations of the style guide regarding SDK usage and recommended model versions, along with some issues in the tutorial notebook that would prevent users from running the application correctly. Overall, this is a great addition, and with these fixes, it will be a very powerful tool.

llmevalkit/requirements.txt

tools/llmevalkit/src/gcp_prompt.py

llmevalkit/prompt-management-tutorial.ipynb

llmevalkit/CONTRIBUTING.md

llmevalkit/README.md

llmevalkit/pages/1_Prompt_Management.py

tools/llmevalkit/pages/3_Evaluation.py

tools/llmevalkit/pages/4_Prompt_Optimization.py

…application for prompt engineering

Michael-Santoro · 2025-09-19T21:47:06Z

Line 74 of llmevalkit/pages/1_Prompt_Management.py included this json_string.replace("’", "'") this failed the spelling check, but we feel that it was important to the code.

holtskinner

Is there any way this could be condensed down/made easier to follow? Since you already have a Jupyter Notebook for the main flow, could the other utility functions be added into it?

holtskinner · 2025-09-22T14:57:50Z

Line 74 of llmevalkit/pages/1_Prompt_Management.py included this json_string.replace("’", "'") this failed the spelling check, but we feel that it was important to the code.

Yes, that's alright.

gericdong · 2025-09-24T16:13:38Z

@Michael-Santoro please fix the spelling. Thanks.

gericdong · 2025-09-24T16:20:54Z

@Michael-Santoro: can you please 1) move the code under /tools? 2) add a brief summary in the readme file to describe what it is and what it does for developers? Thanks.

holtskinner · 2025-09-24T17:01:10Z

@Michael-Santoro please fix the spelling. Thanks.

Don't worry about the current spelling test failure. I'll add an exception for this because the smart quote is required

holtskinner

General question about this sample app. Most, if not all of the steps in this tutorial can be accomplished using the Cloud Console Vertex AI Evaluation page instead of this Streamlit App.

https://console.cloud.google.com/vertex-ai/evaluation/create

I'm not sure this extra UI wrapper for the APIs is needed. I wonder if it would make sense to either restructure the Notebook to show how the API calls for this example would work, or create a tutorial showing how to do this in the cloud console.
Maybe look at updating this tutorial in the docs to follow what you're doing here: https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-genai-console

feat: LLM EvalKit - A practical framework for building a open-source …

05a29e5

…application for prompt engineering

Michael-Santoro requested a review from a team as a code owner September 19, 2025 18:59

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

Michael-Santoro added 4 commits September 19, 2025 20:08

feat: LLM EvalKit - A practical framework for building a open-source …

317e6a3

…application for prompt engineering

Update check-spelling metadata

6ceceb2

Update check-spelling

1399226

Update check-spelling

2ad0a84

holtskinner requested changes Sep 22, 2025

View reviewed changes

feat: addressing review comments

97315ba

Michael-Santoro force-pushed the llm-eval-kit branch from 1d0e1ca to 97315ba Compare September 24, 2025 20:25

Michael-Santoro and others added 5 commits September 24, 2025 14:27

Merge branch 'main' into llm-eval-kit

17c247e

Update spelling check

ce7484b

Delete .github/actions/spelling/expect.txt

d6476bb

Merge branch 'main' into llm-eval-kit

f8f51fd

README formatting

0cda80f

holtskinner requested changes Sep 25, 2025

View reviewed changes

Spelling/formatting

e1de8c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: LLM EvalKit - A practical framework for building a open-source … #2341

feat: LLM EvalKit - A practical framework for building a open-source … #2341

Michael-Santoro commented Sep 19, 2025

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Michael-Santoro commented Sep 19, 2025

Uh oh!

holtskinner left a comment

Uh oh!

holtskinner commented Sep 22, 2025

Uh oh!

gericdong commented Sep 24, 2025

Uh oh!

gericdong commented Sep 24, 2025

Uh oh!

holtskinner commented Sep 24, 2025

Uh oh!

holtskinner left a comment

Uh oh!

Uh oh!

feat: LLM EvalKit - A practical framework for building a open-source … #2341

Are you sure you want to change the base?

feat: LLM EvalKit - A practical framework for building a open-source … #2341

Conversation

Michael-Santoro commented Sep 19, 2025

Summary

Description

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Michael-Santoro commented Sep 19, 2025

Uh oh!

holtskinner left a comment

Choose a reason for hiding this comment

Uh oh!

holtskinner commented Sep 22, 2025

Uh oh!

gericdong commented Sep 24, 2025

Uh oh!

gericdong commented Sep 24, 2025

Uh oh!

holtskinner commented Sep 24, 2025

Uh oh!

holtskinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!