Skip to content

Question Answering Rewards

QA reward functions evaluate agent performance on question answering tasks using F1 score and exact match metrics.

qa_f1_reward

qa_f1_reward(final_response: str, answer: str, trajectory: List[str]) -> float

Calculate the reward for the agent's response based on the F1 score.

Parameters:

  • prediction (str) –

    The agent's predicted answer

  • answer (str) –

    The correct answer

  • trajectory (List[str]) –

    The agent's conversation trajectory

Returns:

  • dict ( float ) –

    A dictionary containing the reward, F1 score, EM score, precision, and recall - reward (float): The reward value - f1 (float): The F1 score - em (float): The EM score - precision (float): The precision score - recall (float): The recall score

Source code in src/agentfly/rewards/qa_reward.py
@reward(name="qa_f1_reward")
def qa_f1_reward(final_response: str, answer: str, trajectory: List[str]) -> float:
    """
    Calculate the reward for the agent's response based on the F1 score.

    Args:
        prediction (str): The agent's predicted answer
        answer (str): The correct answer
        trajectory (List[str]): The agent's conversation trajectory

    Returns:
        dict: A dictionary containing the reward, F1 score, EM score, precision, and recall
            - reward (float): The reward value
            - f1 (float): The F1 score
            - em (float): The EM score
            - precision (float): The precision score
            - recall (float): The recall score
    """
    response = final_response
    f1, precision, recall = f1_score(response, answer)
    em = em_score(response, answer)

    return {
        "reward": f1,
        "f1": f1,
        "em": em,
        "precision": precision,
        "recall": recall,
    }

Returns a dictionary with reward (F1 score), f1, em, precision, and recall.

qa_f1_reward_tool

qa_f1_reward_tool(final_response: str, answer: str, trajectory: List[str]) -> float

Calculate the reward for the agent's response based on the F1 score and EM score. - 0.0 if no tool used - 0.1 if tool used but answer incorrect - 1.0 if tool used and answer correct

Parameters:

  • prediction (str) –

    The agent's predicted answer

  • answer (str) –

    The correct answer

  • trajectory (List[str]) –

    The agent's conversation trajectory

Returns:

  • dict ( float ) –

    A dictionary containing the reward, F1 score, EM score, precision, and recall - reward (float): The reward value - f1 (float): The F1 score - em (float): The EM score - precision (float): The precision score - recall (float): The recall score

Source code in src/agentfly/rewards/qa_reward.py
@reward(name="qa_f1_reward_tool")
def qa_f1_reward_tool(final_response: str, answer: str, trajectory: List[str]) -> float:
    """
    Calculate the reward for the agent's response based on the F1 score and EM score.
    - 0.0 if no tool used
    - 0.1 if tool used but answer incorrect
    - 1.0 if tool used and answer correct

    Args:
        prediction (str): The agent's predicted answer
        answer (str): The correct answer
        trajectory (List[str]): The agent's conversation trajectory

    Returns:
        dict: A dictionary containing the reward, F1 score, EM score, precision, and recall
            - reward (float): The reward value
            - f1 (float): The F1 score
            - em (float): The EM score
            - precision (float): The precision score
            - recall (float): The recall score
    """
    # has_called_tool = False
    call_tool_count = 0
    for msg in trajectory:
        if msg["role"] == "tool":
            # has_called_tool = True
            call_tool_count += 1

    rewards_dict = {}
    # Require at least two tool calls (since the last tool call is the answer)
    if call_tool_count <= 1:
        rewards_dict.update(
            {
                "reward": 0.0,
                "f1": 0.0,
                "em": 0.0,
                "precision": 0.0,
                "recall": 0.0,
            }
        )
    elif call_tool_count > 1:
        f1, precision, recall = f1_score(final_response, answer)
        em = em_score(final_response, answer)
        rewards_dict.update(
            {
                "reward": f1,
                "f1": f1,
                "em": em,
                "precision": precision,
                "recall": recall,
            }
        )
    else:
        raise ValueError(
            f"Invalid prediction or trajectory for qa reward with format: Trajectory: {trajectory}"
        )

    return rewards_dict

Same metrics as qa_f1_reward but the reward is gated on the agent having made at least one tool call (otherwise reward is 0.0).

Technical Details

Text Normalization: - Removes articles (a, an, the) - Normalizes whitespace - Removes punctuation - Converts to lowercase

F1 Score Calculation: - Token-based overlap between prediction and ground truth - Precision = common_tokens / prediction_tokens - Recall = common_tokens / ground_truth_tokens - F1 = 2 * (precision * recall) / (precision + recall)

Exact Match (EM): - Binary score: 1.0 if normalized answers are identical, 0.0 otherwise - Special handling for yes/no/noanswer responses

Tool Usage Detection: - Counts messages with "tool" role in trajectory - qa_f1_reward_tool requires at least one tool call

Example Usage:

from agentfly.rewards import qa_f1_reward, qa_f1_reward_tool

# Basic F1 evaluation
result = qa_f1_reward(
    final_response="Paris is the capital",
    answer="Paris",
    trajectory=[],
)
print(result)
# {"reward": <f1>, "f1": <f1>, "em": <em>, "precision": ..., "recall": ...}

# With tool usage requirement
trajectory = [
    {"role": "assistant", "content": "I need to search for information"},
    {"role": "tool", "content": "search results"},
    {"role": "assistant", "content": "Based on my search, the answer is Paris"},
]
result = qa_f1_reward_tool(
    final_response="Paris",
    answer="Paris",
    trajectory=trajectory,
)
print(result)

Special Cases: - Yes/No questions: Must match exactly or return 0.0 - Empty predictions: Return 0.0 for all metrics - No token overlap: Return 0.0 for all metrics

Use Cases: - Reading comprehension evaluation - Information retrieval task assessment - Knowledge-based question answering - Tool-augmented QA system evaluation