Skip to content

Commit b0697bf

Browse files
authored
Merge pull request #106 from Azure-Samples/update-package
Update to azure-ai-evaluations
2 parents e653bd4 + ad3cbbe commit b0697bf

15 files changed

+114
-110
lines changed

.env.sample

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,14 @@ OPENAI_HOST="azure"
22
OPENAI_GPT_MODEL="gpt-4"
33
# For Azure OpenAI only:
44
AZURE_OPENAI_EVAL_DEPLOYMENT="<deployment-name>"
5-
AZURE_OPENAI_SERVICE="<service-name>"
5+
AZURE_OPENAI_ENDPOINT="https://<service-name>.openai.azure.com"
66
AZURE_OPENAI_KEY=""
7+
AZURE_OPENAI_TENANT_ID=""
78
# For openai.com only:
89
OPENAICOM_KEY=""
910
OPENAICOM_ORGANIZATION=""
1011
# For generating QA based on search index:
11-
AZURE_SEARCH_SERVICE="<service-name>"
12+
AZURE_SEARCH_ENDPOINT="https://<service-name>.search.windows.net"
1213
AZURE_SEARCH_INDEX="<index-name>"
1314
AZURE_SEARCH_KEY=""
15+
AZURE_SEARCH_TENANT_ID=""

.github/workflows/azure-dev.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
AZURE_CREDENTIALS: ${{ secrets.AZURE_CREDENTIALS }}
2929
# project specific
3030
OPENAI_HOST: ${{ vars.OPENAI_HOST }}
31-
AZURE_OPENAI_SERVICE: ${{ vars.AZURE_OPENAI_SERVICE }}
31+
AZURE_OPENAI_ENDPOINT: ${{ vars.AZURE_OPENAI_ENDPOINT }}
3232
AZURE_OPENAI_RESOURCE_GROUP: ${{ vars.AZURE_OPENAI_RESOURCE_GROUP }}
3333
OPENAI_ORGANIZATION: ${{ vars.OPENAI_ORGANIZATION }}
3434
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

README.md

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -57,18 +57,11 @@ We've made that easy to deploy with the `azd` CLI tool.
5757
1. Install the [Azure Developer CLI](https://aka.ms/azure-dev/install)
5858
2. Run `azd auth login` to log in to your Azure account
5959
3. Run `azd up` to deploy a new GPT-4 instance
60-
4. Create a `.env` file based on the provisioned resources by running one of the following commands.
61-
62-
Bash:
60+
4. Create a `.env` file based on the provisioned resources by copying `.env.sample` and filling in the required values.
61+
You can run this command to see the deployed values:
6362
6463
```shell
65-
azd env get-values > .env
66-
```
67-
68-
PowerShell:
69-
70-
```powershell
71-
$output = azd env get-values; Add-Content -Path .env -Value $output;
64+
azd env get-values
7265
```
7366
7467
### Using an existing Azure OpenAI instance
@@ -80,7 +73,7 @@ If you already have an Azure OpenAI instance, you can use that instead of creati
8073
8174
```shell
8275
AZURE_OPENAI_EVAL_DEPLOYMENT="<deployment-name>"
83-
AZURE_OPENAI_SERVICE="<service-name>"
76+
AZURE_OPENAI_ENDPOINT="https://<service-name>.openai.azure.com"
8477
```
8578
8679
3. The scripts default to keyless access (via `AzureDefaultCredential`), but you can optionally use a key by setting `AZURE_OPENAI_KEY` in `.env`.
@@ -129,7 +122,7 @@ This repo includes a script for generating questions and answers from documents
129122
2. Fill in the values for your Azure AI Search instance:
130123

131124
```shell
132-
AZURE_SEARCH_SERVICE="<service-name>"
125+
AZURE_SEARCH_ENDPOINT="https://<service-name>.search.windows.net"
133126
AZURE_SEARCH_INDEX="<index-name>"
134127
AZURE_SEARCH_KEY=""
135128
```

azure.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ metadata:
66
pipeline:
77
variables:
88
- OPENAI_HOST
9-
- AZURE_OPENAI_SERVICE
9+
- AZURE_OPENAI_ENDPOINT
1010
- AZURE_OPENAI_RESOURCE_GROUP
1111
- OPENAI_ORGANIZATION
1212
secrets:

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
[tool.ruff]
22
line-length = 120
3-
target-version = "py311"
3+
target-version = "py39"
44
lint.select = ["E", "F", "I", "UP"]
55
lint.ignore = ["D203"]
66

77
[tool.black]
88
line-length = 120
9-
target-version = ["py311"]
9+
target-version = ["py39"]
1010

1111
[tool.pytest.ini_options]
1212
addopts = "-ra"

scripts/evaluate.py

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -114,16 +114,12 @@ def run_evaluation(
114114
return False
115115

116116
logger.info("Sending a test chat completion to the GPT deployment to ensure it is running...")
117-
try:
118-
gpt_response = service_setup.get_openai_client(openai_config).chat.completions.create(
119-
model=openai_config.model,
120-
messages=[{"role": "user", "content": "Hello!"}],
121-
n=1,
122-
)
123-
logger.info('Successfully received response from GPT: "%s"', gpt_response.choices[0].message.content)
124-
except Exception as e:
125-
logger.error("Failed to send a test chat completion to the GPT deployment due to error: \n%s", e)
126-
return False
117+
gpt_response = service_setup.get_openai_client(openai_config).chat.completions.create(
118+
model=openai_config["model"],
119+
messages=[{"role": "user", "content": "Hello!"}],
120+
n=1,
121+
)
122+
logger.info('Successfully received response from GPT: "%s"', gpt_response.choices[0].message.content)
127123

128124
logger.info("Starting evaluation...")
129125
for metric in requested_metrics:
@@ -149,8 +145,8 @@ def evaluate_row(row):
149145
output.update(target_response)
150146
for metric in requested_metrics:
151147
result = metric.evaluator_fn(openai_config=openai_config)(
152-
question=row["question"],
153-
answer=output["answer"],
148+
query=row["question"],
149+
response=output["answer"],
154150
context=output["context"],
155151
ground_truth=row["truth"],
156152
)
@@ -183,7 +179,7 @@ def evaluate_row(row):
183179

184180
with open(results_dir / "evaluate_parameters.json", "w", encoding="utf-8") as parameters_file:
185181
parameters = {
186-
"evaluation_gpt_model": openai_config.model,
182+
"evaluation_gpt_model": openai_config["model"],
187183
"evaluation_timestamp": int(time.time()),
188184
"testdata_path": str(testdata_path),
189185
"target_url": target_url,

scripts/evaluate_metrics/builtin_metrics.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from promptflow.evals.evaluators import (
1+
from azure.ai.evaluation import (
22
CoherenceEvaluator,
33
F1ScoreEvaluator,
44
FluencyEvaluator,

scripts/evaluate_metrics/code_metrics.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ class AnswerLengthMetric(BaseMetric):
1212

1313
@classmethod
1414
def evaluator_fn(cls, **kwargs):
15-
def answer_length(*, answer, **kwargs):
16-
if answer is None:
17-
logger.warning("Received answer of None, can't compute answer_length metric. Setting to -1.")
15+
def answer_length(*, response, **kwargs):
16+
if response is None:
17+
logger.warning("Received response of None, can't compute answer_length metric. Setting to -1.")
1818
return {cls.METRIC_NAME: -1}
19-
return {cls.METRIC_NAME: len(answer)}
19+
return {cls.METRIC_NAME: len(response)}
2020

2121
return answer_length
2222

@@ -37,11 +37,11 @@ class HasCitationMetric(BaseMetric):
3737

3838
@classmethod
3939
def evaluator_fn(cls, **kwargs):
40-
def has_citation(*, answer, **kwargs):
41-
if answer is None:
42-
logger.warning("Received answer of None, can't compute has_citation metric. Setting to -1.")
40+
def has_citation(*, response, **kwargs):
41+
if response is None:
42+
logger.warning("Received response of None, can't compute has_citation metric. Setting to -1.")
4343
return {cls.METRIC_NAME: -1}
44-
return {cls.METRIC_NAME: bool(re.search(r"\[[^\]]+\]", answer))}
44+
return {cls.METRIC_NAME: bool(re.search(r"\[[^\]]+\]", response))}
4545

4646
return has_citation
4747

@@ -60,14 +60,14 @@ class CitationMatchMetric(BaseMetric):
6060

6161
@classmethod
6262
def evaluator_fn(cls, **kwargs):
63-
def citation_match(*, answer, ground_truth, **kwargs):
64-
if answer is None:
65-
logger.warning("Received answer of None, can't compute citation_match metric. Setting to -1.")
63+
def citation_match(*, response, ground_truth, **kwargs):
64+
if response is None:
65+
logger.warning("Received response of None, can't compute citation_match metric. Setting to -1.")
6666
return {cls.METRIC_NAME: -1}
67-
# Return true if all citations in the truth are present in the answer
67+
# Return true if all citations in the truth are present in the response
6868
truth_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", ground_truth))
69-
answer_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", answer))
70-
citation_match = truth_citations.issubset(answer_citations)
69+
response_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))
70+
citation_match = truth_citations.issubset(response_citations)
7171
return {cls.METRIC_NAME: citation_match}
7272

7373
return citation_match

scripts/evaluate_metrics/prompts/dontknowness.prompty

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ sample:
2727
answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
2828
---
2929
system:
30-
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
30+
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
3131

3232
user:
3333
The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
@@ -59,6 +59,6 @@ question: Where were The Beatles formed?
5959
answer: I'm sorry, I don't know, that answer is not in my sources.
6060
stars: 5
6161

62-
question: {{question}}
63-
answer: {{answer}}
62+
question: {{query}}
63+
answer: {{response}}
6464
stars:

scripts/evaluate_metrics/prompts/mycoherence.prompty

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ model:
66
configuration:
77
type: azure_openai
88
azure_deployment: ${env:AZURE_DEPLOYMENT}
9-
api_key: ${env:AZURE_OPENAI_API_KEY}
109
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
1110
parameters:
1211
temperature: 0.0
@@ -18,11 +17,14 @@ model:
1817
type: text
1918

2019
inputs:
21-
question:
20+
query:
2221
type: string
23-
answer:
22+
response:
2423
type: string
2524

25+
sample:
26+
query: What are the main goals of Perseverance Mars rover mission?
27+
response: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
2628
---
2729
system:
2830
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
@@ -57,6 +59,6 @@ question: What can you tell me about climate change and its effects on the envir
5759
answer: Climate change has far-reaching effects on the environment. Rising temperatures result in the melting of polar ice caps, contributing to sea-level rise. Additionally, more frequent and severe weather events, such as hurricanes and heatwaves, can cause disruption to ecosystems and human societies alike.
5860
stars: 5
5961

60-
question: {{question}}
61-
answer: {{answer}}
62+
question: {{query}}
63+
answer: {{response}}
6264
stars:

0 commit comments

Comments
 (0)