Training Config Reference (Hydra)¶

AgentFly trains via verl's PPO trainer (python3 -m agentfly.cli train ...), which is Hydra-driven. Every key in your training script of the form namespace.path=value overrides a node in the underlying Hydra config tree.

The full schema lives upstream in verl: verl/verl/trainer/config/ppo_trainer.yaml. This page lists the keys you actually need to set in practice, organized by namespace, with notes on what each one controls. For the canonical end-to-end example, see First Training.

Namespaces at a Glance¶

Namespace	Owns
`data.*`	Training/validation file paths and batch sizes
`agent.*`	Agent type, tools, reward, generation, rollout shape
`algorithm.*`	RL algorithm (advantage estimator, KL coefficient)
`actor_rollout_ref.*`	Actor model, rollout engine, reference model (FSDP, vLLM, sequence parallel)
`critic.*`	Critic model (only used when `adv_estimator=gae`)
`trainer.*`	Logging, checkpointing, GPU/node topology, total steps

`data.*` — Data Loading¶

Key	Description
`data.train_files`	Path to the training JSON file. See Data Preparation.
`data.val_files`	Path to the validation JSON file.
`data.train_batch_size`	Number of prompts per training step.
`data.val_batch_size`	Number of prompts per validation step.

`agent.*` — Agent / Rollout¶

These are the keys defined by AgentFly itself (the rest are inherited from verl).

Key	Description
`agent.use_agent`	`True` to enable the agent rollout path. Always `True` for AgentFly training.
`agent.init_config.agent_type`	One of `hf`, `react`, `code`, `gui`, `action`, etc. Picks the agent class via `AutoAgent.from_config`.
`agent.init_config.model_name_or_path`	HF model name or local path.
`agent.init_config.template`	Chat template name (e.g. `qwen2.5`, `qwen2.5-vl`, `qwen2.5-no-system-tool`).
`agent.init_config.tools`	List of tool names available to the agent, e.g. `[calculator]` or `[code_interpreter]`.
`agent.init_config.reward_name`	Name of the reward function registered via `@reward(name=...)`.
`agent.init_config.backend`	Generation backend during training (typically `async_verl`).
`agent.init_config.tool_parser_name`	Optional tool-call parser identifier (e.g. `hermes`).
`agent.init_config.max_model_len`	Max sequence length the model engine should support.
`agent.max_turns`	Max number of generation/tool turns per chain.
`agent.num_chains`	Number of parallel chains (trajectories) per prompt. Higher → more samples for GRPO/RLOO.
`agent.train_on_last_turn`	If `True`, only compute loss on the last assistant turn. Useful for stability on long rollouts.
`agent.generation_config.max_tokens`	Max tokens generated per turn.
`agent.run_config.context_config.resource_backend`	Where containers are allocated: `local` or a cluster backend.
`agent.run_config.max_concurrent_chains`	Cap on chains running concurrently (memory protection). `null` for no extra cap.

Some scripts use agent.run_config.max_turns and agent.run_config.num_chains — these are equivalent to agent.max_turns / agent.num_chains and reach the same Hydra node.

`algorithm.*` — RL Algorithm¶

Key	Description
`algorithm.adv_estimator`	Advantage estimator: `grpo`, `rloo`, `reinforce_plus_plus`, `remax`, or `gae`. `grpo` is the most common for tool-using agents.
`algorithm.kl_ctrl.kl_coef`	KL penalty coefficient against the reference model. Common values: `0.0` to `0.001`.

`actor_rollout_ref.*` — Actor / Rollout / Reference¶

This namespace mirrors verl's grouping: the same set of model-side parameters is shared between the actor (the policy being trained), the rollout worker (which generates trajectories), and the reference model (used for KL).

Actor (training)¶

Key	Description
`actor_rollout_ref.actor.optim.lr`	Learning rate.
`actor_rollout_ref.actor.optim.lr_warmup_steps_ratio`	Fraction of training spent on warmup.
`actor_rollout_ref.actor.ppo_mini_batch_size`	Mini-batch size for PPO updates within one outer step.
`actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`	Per-GPU micro-batch (controls memory).
`actor_rollout_ref.actor.use_kl_loss`	If `True`, add a KL term to the loss instead of (or alongside) the reward shaping.
`actor_rollout_ref.actor.kl_loss_coef`	Coefficient for the in-loss KL term.
`actor_rollout_ref.actor.kl_loss_type`	Form of the KL term: `mse`, `kl`, etc.
`actor_rollout_ref.actor.entropy_coeff`	Entropy regularization. Small values (≤ 1e-3) help stabilize tool-calling agents.
`actor_rollout_ref.actor.fsdp_config.param_offload`	Offload params to CPU. Required for big models on small clusters.
`actor_rollout_ref.actor.fsdp_config.optimizer_offload`	Offload optimizer state to CPU.
`actor_rollout_ref.actor.ulysses_sequence_parallel_size`	Sequence-parallel degree (Ulysses). Useful for very long contexts.

Model¶

Key	Description
`actor_rollout_ref.model.path`	Same model path as `agent.init_config.model_name_or_path`.
`actor_rollout_ref.model.use_remove_padding`	Whether to pack sequences to remove padding. Disable if it conflicts with your masking logic.
`actor_rollout_ref.model.enable_gradient_checkpointing`	Trade compute for memory.
`actor_rollout_ref.model.enable_activation_offload`	Stronger memory/compute trade.

Rollout (inference during training)¶

Key	Description
`actor_rollout_ref.rollout.name`	Rollout engine, typically `vllm`.
`actor_rollout_ref.rollout.gpu_memory_utilization`	Fraction of GPU memory the rollout engine may use.
`actor_rollout_ref.rollout.tensor_model_parallel_size`	TP degree for the rollout engine.
`actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu`	Per-GPU batch when re-computing log-probs over rollout outputs.

Reference model¶

Key	Description
`actor_rollout_ref.ref.fsdp_config.param_offload`	Offload reference params (almost always `True`).
`actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu`	Per-GPU batch for reference log-probs.
`actor_rollout_ref.ref.ulysses_sequence_parallel_size`	Sequence-parallel degree on the reference.

`critic.*` — Critic Model¶

Only relevant when algorithm.adv_estimator=gae (GRPO and similar are critic-free).

Key	Description
`critic.model.path`	Critic model path (often the same as the actor).
`critic.model.enable_activation_offload`	Memory trade for the critic.
`critic.ppo_mini_batch_size`	Critic mini-batch size.
`critic.ppo_micro_batch_size_per_gpu`	Critic per-GPU micro-batch.

`trainer.*` — Training Loop & Logging¶

Key	Description
`trainer.project_name`	WandB / logger project.
`trainer.experiment_name`	WandB / logger run name.
`trainer.logger`	List of loggers, e.g. `['console','wandb']`.
`trainer.n_gpus_per_node`	GPUs per node.
`trainer.nnodes`	Number of nodes.
`trainer.total_training_steps`	Total outer training steps.
`trainer.save_freq`	Checkpoint every N steps.
`trainer.test_freq`	Run validation every N steps.
`trainer.val_before_train`	Run validation once before training begins (sanity check).
`trainer.critic_warmup`	Number of critic-only warmup steps (only relevant for GAE).

Canonical Example¶

The WebShop training script wires the above together as a single agentfly.cli train invocation. This is a clean Hydra-only example (Ray cluster setup is handled separately by your launcher):

model=Qwen/Qwen2.5-3B-Instruct


system_prompt="You are an autonomous shopping agent operating in the WebShop web environment. Your goal is to purchase exactly one product that best matches the user's natural-language instruction.
You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, you can do one action by <action> action </action>. If you think you have finished the task, summarize what you have done.

## WebShop page types (you will infer from observation)

- WebShop states are webpages of four types: Search page, Results page, Item page, Item-detail page.

## Available actions (the ONLY actions you may take)

### Search action (only available on the Search page):

- search[QUERY]: Search -> Results
- QUERY should be a short, high-signal shopping query.

### Choose action (only available on non-search pages; you must choose one of the clickable buttons shown in the page):

- choose[Back to search]: * -> Search
- choose[Prev page] / choose[Next page]: Results -> Results
- choose[PRODUCT_TITLE]: Results -> Item
- choose[OPTION]: Item -> Item (e.g., size/color/pack)
- choose[Desc] / choose[Overview]: Item -> Item-detail
- choose[Previous]: Item-detail -> Item
- choose[Buy] (or choose[Buy Now] if that is the button text shown): Item -> Episode End

Remember that you must put your action inside <action> and </action> tags."

template=action-agent
lr=5e-7
max_model_len=16384
max_new_tokens_per_turn=384
val_batch_size=512
batch_size=64
num_chains=8
# full on-policy
mini_batch_size=$((batch_size * num_chains))
kl_coef=0.001
train_dataset="./data/rlhf/webshop/webshop_goals_train.json"
eval_dataset="./data/rlhf/webshop/webshop_goals_val.json"
# adv_estimator=rloo
# adv_estimator=reinforce_plus_plus
# adv_estimator=remax
adv_estimator=grpo
# adv_estimator=gae

agent_type=action
tools="[webshop_browser_action]"
reward_name="webshop_reward"

entropy_coeff=0.001
kl_loss_type=mse
max_turns=15
lr_warmup_steps_ratio=0.01
total_training_steps=200

model_base_name=$(basename $model)
project_name="Open"
experiment_name="webshop_${model_base_name}_${adv_estimator}_test"

python -m agentfly.cli train \
    algorithm.adv_estimator=$adv_estimator \
    data.train_files=${train_dataset} \
    data.val_files=${eval_dataset} \
    data.val_batch_size=$val_batch_size \
    data.train_batch_size=$batch_size \
    agent.use_agent=True \
    agent.init_config.agent_type=$agent_type \
    "agent.init_config.system_prompt=\"${system_prompt}\"" \
    agent.init_config.max_model_len=$max_model_len \
    agent.init_config.tools=$tools \
    agent.init_config.template=$template \
    agent.init_config.model_name_or_path=$model \
    agent.init_config.reward_name=$reward_name \
    agent.run_config.generation_config.max_tokens=$max_new_tokens_per_turn \
    agent.run_config.max_turns=${max_turns} \
    agent.run_config.num_chains=$num_chains \
    actor_rollout_ref.actor.optim.lr=$lr \
    actor_rollout_ref.model.use_remove_padding=False \
    actor_rollout_ref.model.path=${model} \
    actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=${lr_warmup_steps_ratio} \
    actor_rollout_ref.actor.ppo_mini_batch_size=$mini_batch_size \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=$kl_coef \
    actor_rollout_ref.actor.kl_loss_type=$kl_loss_type \
    actor_rollout_ref.actor.entropy_coeff=$entropy_coeff \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    critic.model.path=$model \
    critic.ppo_mini_batch_size=32 \
    critic.ppo_micro_batch_size_per_gpu=2 \
    algorithm.kl_ctrl.kl_coef=$kl_coef \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name=$project_name \
    trainer.experiment_name=$experiment_name \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=100 \
    trainer.test_freq=300 \
    trainer.total_training_steps=$total_training_steps \
    trainer.val_before_train=False

Where to Look When You Need More¶

Full Hydra schema — verl/verl/trainer/config/ppo_trainer.yaml. Everything not listed above is in there with defaults.
Generated schemas — verl/verl/trainer/config/_generated_ppo_trainer.yaml (the materialized config tree, useful for grepping).
Per-run config dump — Run with agent.use_agent=True ... +hydra.job.chdir=False trainer.print_config=True (when supported) to see the exact resolved config for a run.
Other shipped scripts — examples/train_scripts/{webshop,alfworld,scienceworld,search,swe,vqa,...}/ show domain-specific overrides (different reward names, tools, batch sizes, sequence-parallel settings).