Skip to content

Conversation

Michael-Santoro
Copy link

…application for prompt engineering

Summary

This project is a powerful tool designed for the systematic evaluation and optimization of prompts for Large Language Models (LLMs). It provides a robust Python-based framework specifically for assessing prompt performance against a defined problem, enabling users to identify the most effective prompt variations.

The core purpose of this tool is to abstract away some of the complexities of Google Cloud's Vertex AI platform. By simplifying interactions with the underlying infrastructure, it empowers users to concentrate on the critical task of prompt engineering and performance analysis.

Ultimately, this project enables teams to rigorously evaluate and deploy the best possible prompts for their applications, ensuring higher quality and more reliable results from their Large Language Models.

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
  • Your account is listed in CODEOWNERS for the file(s).
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).

Fixes #<issue_number_goes_here> 🦕

@Michael-Santoro Michael-Santoro requested a review from a team as a code owner September 19, 2025 18:59
Copy link
Contributor

Summary of Changes

Hello @Michael-Santoro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a powerful new tool, the LLM EvalKit, which streamlines the entire prompt engineering process. It provides a robust framework for developers to efficiently create, test, evaluate, and optimize LLM prompts, significantly simplifying interactions with Vertex AI and enabling a more focused approach to achieving high-quality, reliable results from large language models.

Highlights

  • Introduction of LLM EvalKit: This pull request introduces the LLM EvalKit, a comprehensive Python-based framework designed for the systematic evaluation and optimization of prompts for Large Language Models (LLMs), abstracting complexities of Google Cloud's Vertex AI platform.
  • Comprehensive Streamlit UI: A new Streamlit application provides a user-friendly interface for managing the entire prompt engineering lifecycle, including prompt creation, editing, testing, dataset management, evaluation, and optimization.
  • Prompt Management Features: Users can now create new prompts, load and edit existing ones, test them with sample inputs, and save changes as new versions, facilitating iterative development and A/B testing.
  • Dataset Creation and Management: The kit includes functionality to upload and manage evaluation datasets (CSV, JSON, JSONL) directly to Google Cloud Storage, organizing them into logical 'datasets' for easy access.
  • Evaluation Workbench: A dedicated evaluation workbench supports both human-in-the-loop rating and automated model-based evaluations, allowing for comparison of model responses against ground truth using various metrics.
  • Prompt Optimization Integration: The framework integrates with Vertex AI's prompt optimization capabilities, enabling users to configure and launch automated jobs to refine and improve prompt performance based on specified datasets.
  • Results and Records Browsing: Users can review the results of prompt optimization jobs, compare different prompt versions, and track evaluation outcomes over time through a 'Prompt Records Leaderboard'.
  • Detailed Tutorial Notebook: A new Jupyter notebook provides a step-by-step tutorial for setting up and using the LLM EvalKit, guiding users through prompt management, dataset creation, evaluation, and optimization workflows.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the LLM EvalKit, a comprehensive framework for managing, evaluating, and optimizing LLM prompts. The contribution is substantial, adding a full Streamlit application with multiple pages, backend logic for interacting with Google Cloud services, and a tutorial notebook. My review focuses on ensuring adherence to the provided style guide, correctness of the implementation, and the usability of the tutorial. I've identified some critical issues, including a missing dependency and a bug in response generation, that need to be addressed. There are also several violations of the style guide regarding SDK usage and recommended model versions, along with some issues in the tutorial notebook that would prevent users from running the application correctly. Overall, this is a great addition, and with these fixes, it will be a very powerful tool.

@Michael-Santoro
Copy link
Author

Line 74 of llmevalkit/pages/1_Prompt_Management.py included this json_string.replace("’", "'") this failed the spelling check, but we feel that it was important to the code.

Copy link
Collaborator

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way this could be condensed down/made easier to follow? Since you already have a Jupyter Notebook for the main flow, could the other utility functions be added into it?

@holtskinner
Copy link
Collaborator

Line 74 of llmevalkit/pages/1_Prompt_Management.py included this json_string.replace("’", "'") this failed the spelling check, but we feel that it was important to the code.

Yes, that's alright.

@gericdong
Copy link
Contributor

@Michael-Santoro please fix the spelling. Thanks.

@gericdong
Copy link
Contributor

@Michael-Santoro: can you please 1) move the code under /tools? 2) add a brief summary in the readme file to describe what it is and what it does for developers? Thanks.

@holtskinner
Copy link
Collaborator

@Michael-Santoro please fix the spelling. Thanks.

Don't worry about the current spelling test failure. I'll add an exception for this because the smart quote is required

Copy link
Collaborator

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question about this sample app. Most, if not all of the steps in this tutorial can be accomplished using the Cloud Console Vertex AI Evaluation page instead of this Streamlit App.

https://console.cloud.google.com/vertex-ai/evaluation/create

I'm not sure this extra UI wrapper for the APIs is needed. I wonder if it would make sense to either restructure the Notebook to show how the API calls for this example would work, or create a tutorial showing how to do this in the cloud console.
Maybe look at updating this tutorial in the docs to follow what you're doing here: https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-genai-console

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants