Skip to content

Training Config Reference (Hydra)

AgentFly trains via verl's PPO trainer (python3 -m agentfly.cli train ...), which is Hydra-driven. Every key in your training script of the form namespace.path=value overrides a node in the underlying Hydra config tree.

The full schema lives upstream in verl: verl/verl/trainer/config/ppo_trainer.yaml. This page lists the keys you actually need to set in practice, organized by namespace, with notes on what each one controls. For the canonical end-to-end example, see First Training.

Namespaces at a Glance

Namespace Owns
data.* Training/validation file paths and batch sizes
agent.* Agent type, tools, reward, generation, rollout shape
algorithm.* RL algorithm (advantage estimator, KL coefficient)
actor_rollout_ref.* Actor model, rollout engine, reference model (FSDP, vLLM, sequence parallel)
critic.* Critic model (only used when adv_estimator=gae)
trainer.* Logging, checkpointing, GPU/node topology, total steps

data.* — Data Loading

Key Description
data.train_files Path to the training JSON file. See Data Preparation.
data.val_files Path to the validation JSON file.
data.train_batch_size Number of prompts per training step.
data.val_batch_size Number of prompts per validation step.

agent.* — Agent / Rollout

These are the keys defined by AgentFly itself (the rest are inherited from verl).

Key Description
agent.use_agent True to enable the agent rollout path. Always True for AgentFly training.
agent.init_config.agent_type One of hf, react, code, gui, action, etc. Picks the agent class via AutoAgent.from_config.
agent.init_config.model_name_or_path HF model name or local path.
agent.init_config.template Chat template name (e.g. qwen2.5, qwen2.5-vl, qwen2.5-no-system-tool).
agent.init_config.tools List of tool names available to the agent, e.g. [calculator] or [code_interpreter].
agent.init_config.reward_name Name of the reward function registered via @reward(name=...).
agent.init_config.backend Generation backend during training (typically async_verl).
agent.init_config.tool_parser_name Optional tool-call parser identifier (e.g. hermes).
agent.init_config.max_model_len Max sequence length the model engine should support.
agent.max_turns Max number of generation/tool turns per chain.
agent.num_chains Number of parallel chains (trajectories) per prompt. Higher → more samples for GRPO/RLOO.
agent.train_on_last_turn If True, only compute loss on the last assistant turn. Useful for stability on long rollouts.
agent.generation_config.max_tokens Max tokens generated per turn.
agent.run_config.context_config.resource_backend Where containers are allocated: local or a cluster backend.
agent.run_config.max_concurrent_chains Cap on chains running concurrently (memory protection). null for no extra cap.

Some scripts use agent.run_config.max_turns and agent.run_config.num_chains — these are equivalent to agent.max_turns / agent.num_chains and reach the same Hydra node.

algorithm.* — RL Algorithm

Key Description
algorithm.adv_estimator Advantage estimator: grpo, rloo, reinforce_plus_plus, remax, or gae. grpo is the most common for tool-using agents.
algorithm.kl_ctrl.kl_coef KL penalty coefficient against the reference model. Common values: 0.0 to 0.001.

actor_rollout_ref.* — Actor / Rollout / Reference

This namespace mirrors verl's grouping: the same set of model-side parameters is shared between the actor (the policy being trained), the rollout worker (which generates trajectories), and the reference model (used for KL).

Actor (training)

Key Description
actor_rollout_ref.actor.optim.lr Learning rate.
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio Fraction of training spent on warmup.
actor_rollout_ref.actor.ppo_mini_batch_size Mini-batch size for PPO updates within one outer step.
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu Per-GPU micro-batch (controls memory).
actor_rollout_ref.actor.use_kl_loss If True, add a KL term to the loss instead of (or alongside) the reward shaping.
actor_rollout_ref.actor.kl_loss_coef Coefficient for the in-loss KL term.
actor_rollout_ref.actor.kl_loss_type Form of the KL term: mse, kl, etc.
actor_rollout_ref.actor.entropy_coeff Entropy regularization. Small values (≤ 1e-3) help stabilize tool-calling agents.
actor_rollout_ref.actor.fsdp_config.param_offload Offload params to CPU. Required for big models on small clusters.
actor_rollout_ref.actor.fsdp_config.optimizer_offload Offload optimizer state to CPU.
actor_rollout_ref.actor.ulysses_sequence_parallel_size Sequence-parallel degree (Ulysses). Useful for very long contexts.

Model

Key Description
actor_rollout_ref.model.path Same model path as agent.init_config.model_name_or_path.
actor_rollout_ref.model.use_remove_padding Whether to pack sequences to remove padding. Disable if it conflicts with your masking logic.
actor_rollout_ref.model.enable_gradient_checkpointing Trade compute for memory.
actor_rollout_ref.model.enable_activation_offload Stronger memory/compute trade.

Rollout (inference during training)

Key Description
actor_rollout_ref.rollout.name Rollout engine, typically vllm.
actor_rollout_ref.rollout.gpu_memory_utilization Fraction of GPU memory the rollout engine may use.
actor_rollout_ref.rollout.tensor_model_parallel_size TP degree for the rollout engine.
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu Per-GPU batch when re-computing log-probs over rollout outputs.

Reference model

Key Description
actor_rollout_ref.ref.fsdp_config.param_offload Offload reference params (almost always True).
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu Per-GPU batch for reference log-probs.
actor_rollout_ref.ref.ulysses_sequence_parallel_size Sequence-parallel degree on the reference.

critic.* — Critic Model

Only relevant when algorithm.adv_estimator=gae (GRPO and similar are critic-free).

Key Description
critic.model.path Critic model path (often the same as the actor).
critic.model.enable_activation_offload Memory trade for the critic.
critic.ppo_mini_batch_size Critic mini-batch size.
critic.ppo_micro_batch_size_per_gpu Critic per-GPU micro-batch.

trainer.* — Training Loop & Logging

Key Description
trainer.project_name WandB / logger project.
trainer.experiment_name WandB / logger run name.
trainer.logger List of loggers, e.g. ['console','wandb'].
trainer.n_gpus_per_node GPUs per node.
trainer.nnodes Number of nodes.
trainer.total_training_steps Total outer training steps.
trainer.save_freq Checkpoint every N steps.
trainer.test_freq Run validation every N steps.
trainer.val_before_train Run validation once before training begins (sanity check).
trainer.critic_warmup Number of critic-only warmup steps (only relevant for GAE).

Canonical Example

The WebShop training script wires the above together as a single agentfly.cli train invocation. This is a clean Hydra-only example (Ray cluster setup is handled separately by your launcher):

model=Qwen/Qwen2.5-3B-Instruct


system_prompt="You are an autonomous shopping agent operating in the WebShop web environment. Your goal is to purchase exactly one product that best matches the user's natural-language instruction.
You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, you can do one action by <action> action </action>. If you think you have finished the task, summarize what you have done.

## WebShop page types (you will infer from observation)

- WebShop states are webpages of four types: Search page, Results page, Item page, Item-detail page.

## Available actions (the ONLY actions you may take)

### Search action (only available on the Search page):

- search[QUERY]: Search -> Results
- QUERY should be a short, high-signal shopping query.

### Choose action (only available on non-search pages; you must choose one of the clickable buttons shown in the page):

- choose[Back to search]: * -> Search
- choose[Prev page] / choose[Next page]: Results -> Results
- choose[PRODUCT_TITLE]: Results -> Item
- choose[OPTION]: Item -> Item (e.g., size/color/pack)
- choose[Desc] / choose[Overview]: Item -> Item-detail
- choose[Previous]: Item-detail -> Item
- choose[Buy] (or choose[Buy Now] if that is the button text shown): Item -> Episode End

Remember that you must put your action inside <action> and </action> tags."

template=action-agent
lr=5e-7
max_model_len=16384
max_new_tokens_per_turn=384
val_batch_size=512
batch_size=64
num_chains=8
# full on-policy
mini_batch_size=$((batch_size * num_chains))
kl_coef=0.001
train_dataset="./data/rlhf/webshop/webshop_goals_train.json"
eval_dataset="./data/rlhf/webshop/webshop_goals_val.json"
# adv_estimator=rloo
# adv_estimator=reinforce_plus_plus
# adv_estimator=remax
adv_estimator=grpo
# adv_estimator=gae

agent_type=action
tools="[webshop_browser_action]"
reward_name="webshop_reward"

entropy_coeff=0.001
kl_loss_type=mse
max_turns=15
lr_warmup_steps_ratio=0.01
total_training_steps=200

model_base_name=$(basename $model)
project_name="Open"
experiment_name="webshop_${model_base_name}_${adv_estimator}_test"

python -m agentfly.cli train \
    algorithm.adv_estimator=$adv_estimator \
    data.train_files=${train_dataset} \
    data.val_files=${eval_dataset} \
    data.val_batch_size=$val_batch_size \
    data.train_batch_size=$batch_size \
    agent.use_agent=True \
    agent.init_config.agent_type=$agent_type \
    "agent.init_config.system_prompt=\"${system_prompt}\"" \
    agent.init_config.max_model_len=$max_model_len \
    agent.init_config.tools=$tools \
    agent.init_config.template=$template \
    agent.init_config.model_name_or_path=$model \
    agent.init_config.reward_name=$reward_name \
    agent.run_config.generation_config.max_tokens=$max_new_tokens_per_turn \
    agent.run_config.max_turns=${max_turns} \
    agent.run_config.num_chains=$num_chains \
    actor_rollout_ref.actor.optim.lr=$lr \
    actor_rollout_ref.model.use_remove_padding=False \
    actor_rollout_ref.model.path=${model} \
    actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=${lr_warmup_steps_ratio} \
    actor_rollout_ref.actor.ppo_mini_batch_size=$mini_batch_size \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=$kl_coef \
    actor_rollout_ref.actor.kl_loss_type=$kl_loss_type \
    actor_rollout_ref.actor.entropy_coeff=$entropy_coeff \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    critic.model.path=$model \
    critic.ppo_mini_batch_size=32 \
    critic.ppo_micro_batch_size_per_gpu=2 \
    algorithm.kl_ctrl.kl_coef=$kl_coef \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name=$project_name \
    trainer.experiment_name=$experiment_name \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=100 \
    trainer.test_freq=300 \
    trainer.total_training_steps=$total_training_steps \
    trainer.val_before_train=False

Where to Look When You Need More

  • Full Hydra schemaverl/verl/trainer/config/ppo_trainer.yaml. Everything not listed above is in there with defaults.
  • Generated schemasverl/verl/trainer/config/_generated_ppo_trainer.yaml (the materialized config tree, useful for grepping).
  • Per-run config dump — Run with agent.use_agent=True ... +hydra.job.chdir=False trainer.print_config=True (when supported) to see the exact resolved config for a run.
  • Other shipped scriptsexamples/train_scripts/{webshop,alfworld,scienceworld,search,swe,vqa,...}/ show domain-specific overrides (different reward names, tools, batch sizes, sequence-parallel settings).