Question Answering Rewards¶
QA reward functions evaluate agent performance on question answering tasks using F1 score and exact match metrics.
qa_f1_reward¶
qa_f1_reward(final_response: str, answer: str, trajectory: List[str]) -> float
¶
Calculate the reward for the agent's response based on the F1 score.
Parameters:
-
prediction(str) –The agent's predicted answer
-
answer(str) –The correct answer
-
trajectory(List[str]) –The agent's conversation trajectory
Returns:
-
dict(float) –A dictionary containing the reward, F1 score, EM score, precision, and recall - reward (float): The reward value - f1 (float): The F1 score - em (float): The EM score - precision (float): The precision score - recall (float): The recall score
Source code in src/agentfly/rewards/qa_reward.py
Returns a dictionary with reward (F1 score), f1, em, precision, and recall.
qa_f1_reward_tool¶
qa_f1_reward_tool(final_response: str, answer: str, trajectory: List[str]) -> float
¶
Calculate the reward for the agent's response based on the F1 score and EM score. - 0.0 if no tool used - 0.1 if tool used but answer incorrect - 1.0 if tool used and answer correct
Parameters:
-
prediction(str) –The agent's predicted answer
-
answer(str) –The correct answer
-
trajectory(List[str]) –The agent's conversation trajectory
Returns:
-
dict(float) –A dictionary containing the reward, F1 score, EM score, precision, and recall - reward (float): The reward value - f1 (float): The F1 score - em (float): The EM score - precision (float): The precision score - recall (float): The recall score
Source code in src/agentfly/rewards/qa_reward.py
Same metrics as qa_f1_reward but the reward is gated on the agent having made at least one tool call (otherwise reward is 0.0).
Technical Details¶
Text Normalization: - Removes articles (a, an, the) - Normalizes whitespace - Removes punctuation - Converts to lowercase
F1 Score Calculation: - Token-based overlap between prediction and ground truth - Precision = common_tokens / prediction_tokens - Recall = common_tokens / ground_truth_tokens - F1 = 2 * (precision * recall) / (precision + recall)
Exact Match (EM): - Binary score: 1.0 if normalized answers are identical, 0.0 otherwise - Special handling for yes/no/noanswer responses
Tool Usage Detection:
- Counts messages with "tool" role in trajectory
- qa_f1_reward_tool requires at least one tool call
Example Usage:
from agentfly.rewards import qa_f1_reward, qa_f1_reward_tool
# Basic F1 evaluation
result = qa_f1_reward(
final_response="Paris is the capital",
answer="Paris",
trajectory=[],
)
print(result)
# {"reward": <f1>, "f1": <f1>, "em": <em>, "precision": ..., "recall": ...}
# With tool usage requirement
trajectory = [
{"role": "assistant", "content": "I need to search for information"},
{"role": "tool", "content": "search results"},
{"role": "assistant", "content": "Based on my search, the answer is Paris"},
]
result = qa_f1_reward_tool(
final_response="Paris",
answer="Paris",
trajectory=trajectory,
)
print(result)
Special Cases: - Yes/No questions: Must match exactly or return 0.0 - Empty predictions: Return 0.0 for all metrics - No token overlap: Return 0.0 for all metrics
Use Cases: - Reading comprehension evaluation - Information retrieval task assessment - Knowledge-based question answering - Tool-augmented QA system evaluation