Training Guide¶
Overview¶
AgentFly uses reinforcement learning (RL) to train agents, specifically leveraging Proximal Policy Optimization (PPO) with various advantage estimation methods. The training system is built on top of VERL (Volcano Engine Reinforcement Learning) and supports distributed training with Ray.
Training Command¶
The main training entry point is:
All configuration is done through Hydra command-line overrides, allowing flexible configuration without modifying code.
Training Options¶
Agent Configuration¶
Configure the agent behavior and capabilities:
agent.use_agent: Enable agent mode (default:True)agent.init_config.model_name_or_path: HuggingFace model identifier or path (e.g.,Qwen/Qwen2.5-3B-Instruct)agent.init_config.template: Chat template name (e.g.,qwen2.5,qwen2.5-vl)agent.init_config.agent_type: Agent type -hf,react,code,gui, etc.agent.init_config.tools: List of tools available to the agent (e.g.,[calculator],[google_search,answer])agent.init_config.reward_name: Name of the reward function to use (e.g.,math_equal_reward_tool,qa_f1_reward)agent.init_config.backend_config.backend: Backend for agent execution -async_verl(recommended) or othersagent.run_config.max_turns: Maximum number of interaction turns per episodeagent.run_config.num_chains: Number of parallel interaction chains per sampleagent.run_config.max_concurrent_chains: Maximum concurrent running chains (nullmeans no extra cap)agent.run_config.generation_config.max_tokens: Max tokens to generate per turnagent.run_config.generation_config.temperature: Sampling temperature used during rollout generationagent.run_config.context_config.resource_backend: Resource backend for tool/reward resources (e.g.,local,ray)
Algorithm Configuration¶
Configure the RL algorithm:
algorithm.adv_estimator: Advantage estimation method:gae: Generalized Advantage Estimation (default)grpo: Group Relative Policy Optimizationreinforce_plus_plus: REINFORCE++ estimatorrloo: REINFORCE Leave-One-Outremax: REINFORCE with Max
algorithm.gamma: Discount factor for future rewards (default:1.0)algorithm.lam: GAE lambda parameter (default:1.0)algorithm.use_kl_in_reward: Whether to include KL penalty in reward (default:False)algorithm.kl_penalty: KL estimation method -kl,abs,mse,low_var_kl, orfullalgorithm.kl_ctrl.type: KL control type -fixedoradaptivealgorithm.kl_ctrl.kl_coef: KL penalty coefficient (default:0.001)algorithm.kl_ctrl.target_kl: Target KL divergence for adaptive control (default:0.1)algorithm.kl_ctrl.horizon: Horizon for adaptive controller (default:10000)
Trainer Configuration¶
Configure training execution:
trainer.project_name: Project name for experiment tracking (e.g.,AgentRL)trainer.experiment_name: Experiment name for run identificationtrainer.logger: Logging backends -['console'],['wandb'], or['console','wandb']trainer.total_training_steps: Total number of training stepstrainer.total_epochs: Total number of training epochs (alternative to steps)trainer.nnodes: Number of nodes for distributed training (default:1)trainer.n_gpus_per_node: Number of GPUs per node (default:8)trainer.save_freq: Frequency (in iterations) to save checkpoints (default:-1for no saving)trainer.test_freq: Frequency (in iterations) to run validation (default:-1for no validation)trainer.val_before_train: Whether to run validation before training starts (default:True)trainer.critic_warmup: Number of iterations to warm up critic before policy updates (default:0)trainer.resume_mode: Resume mode -auto,disable, orresume_pathtrainer.resume_from_path: Path to resume from (whenresume_mode=resume_path)
Data Configuration¶
Configure training and validation data:
data.train_files: Path to training dataset file (JSON format)data.val_files: Path to validation dataset file (JSON format)data.train_batch_size: Training batch sizedata.val_batch_size: Validation batch size
Actor/Rollout/Reference Configuration¶
Configure the actor model, rollout engine, and reference model:
actor_rollout_ref.actor.optim.lr: Learning rate for actor (e.g.,5e-7)actor_rollout_ref.actor.ppo_mini_batch_size: PPO mini-batch sizeactor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: Micro-batch size per GPUactor_rollout_ref.actor.use_kl_loss: Whether to use KL loss in actor training (default:True)actor_rollout_ref.actor.kl_loss_coef: KL loss coefficientactor_rollout_ref.actor.kl_loss_type: KL loss type -mse,kl, etc.actor_rollout_ref.actor.entropy_coeff: Entropy coefficient for exploration (default:0.001)actor_rollout_ref.actor.fsdp_config.param_offload: Enable parameter offloading for FSDPactor_rollout_ref.actor.fsdp_config.optimizer_offload: Enable optimizer offloading for FSDPactor_rollout_ref.model.path: Model path for actoractor_rollout_ref.model.enable_gradient_checkpointing: Enable gradient checkpointing (default:False)actor_rollout_ref.rollout.name: Rollout engine name -vllm(recommended)actor_rollout_ref.rollout.response_length: Maximum response length for rolloutsactor_rollout_ref.rollout.tensor_model_parallel_size: Tensor parallelism size for rolloutactor_rollout_ref.rollout.gpu_memory_utilization: GPU memory utilization for vLLM (default:0.5)actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: Micro-batch size for log probability computationactor_rollout_ref.ref.fsdp_config.param_offload: Parameter offloading for reference model
Critic Configuration¶
Configure the value function (critic):
critic.model.path: Model path for criticcritic.ppo_mini_batch_size: PPO mini-batch size for criticcritic.ppo_micro_batch_size_per_gpu: Micro-batch size per GPU for critic
Example Training Commands¶
Basic Math Agent Training¶
Train a simple agent with calculator tool on GSM8K:
# Setup Ray cluster
ray start --head --node-ip-address="$(hostname --ip-address)" --port=6379 --num-cpus=192 --num-gpus=1
# Training command
python -m agentfly.cli train \
algorithm.adv_estimator=grpo \
data.train_files="./data/rlhf/math/gsm8k_train.json" \
data.val_files="./data/rlhf/math/gsm8k_test.json" \
data.train_batch_size=64 \
agent.use_agent=True \
agent.init_config.agent_type=hf \
agent.init_config.tools="[calculator]" \
agent.init_config.template=qwen2.5 \
agent.init_config.model_name_or_path=Qwen/Qwen2.5-3B-Instruct \
agent.init_config.reward_name="math_equal_reward_tool" \
agent.init_config.backend_config.backend=async_verl \
agent.run_config.max_turns=3 \
agent.run_config.num_chains=8 \
agent.run_config.generation_config.max_tokens=256 \
agent.run_config.context_config.resource_backend=local \
actor_rollout_ref.actor.optim.lr=5e-7 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=mse \
actor_rollout_ref.actor.entropy_coeff=0.001 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.response_length=256 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.model.path=Qwen/Qwen2.5-3B-Instruct \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console','wandb'] \
trainer.project_name=AgentRL \
trainer.experiment_name=test_gsm8k \
trainer.n_gpus_per_node=1 \
trainer.nnodes=1 \
trainer.save_freq=50 \
trainer.test_freq=10 \
trainer.total_training_steps=100
ReAct Agent with Search Tools¶
Train a ReAct agent with search capabilities:
python -m agentfly.cli train \
algorithm.adv_estimator=reinforce_plus_plus \
data.train_files="./data/rlhf/qa/train_random_8000.json" \
data.val_files="./data/rlhf/qa/dev_random_500.json" \
data.train_batch_size=128 \
data.val_batch_size=512 \
agent.use_agent=True \
agent.init_config.agent_type=react \
agent.init_config.tools="[google_search,answer]" \
agent.init_config.model_name_or_path=Qwen/Qwen2.5-3B-Instruct \
agent.init_config.reward_name="qa_f1_reward" \
agent.init_config.backend_config.backend=async_verl \
agent.run_config.max_turns=4 \
agent.run_config.num_chains=1 \
agent.run_config.generation_config.max_tokens=512 \
agent.run_config.context_config.resource_backend=local \
actor_rollout_ref.actor.optim.lr=5e-7 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=mse \
actor_rollout_ref.actor.entropy_coeff=0.001 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.response_length=512 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.model.path=Qwen/Qwen2.5-3B-Instruct \
critic.ppo_mini_batch_size=32 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console','wandb'] \
trainer.project_name=AgentRL \
trainer.experiment_name=react_search_agent \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=50 \
trainer.test_freq=10 \
trainer.total_training_steps=200 \
trainer.val_before_train=True
Code Agent Training¶
Train an agent for code generation tasks:
python -m agentfly.cli train \
algorithm.adv_estimator=grpo \
data.train_files="./data/rlhf/code/train.json" \
data.val_files="./data/rlhf/code/val.json" \
data.train_batch_size=64 \
agent.use_agent=True \
agent.init_config.agent_type=code \
agent.init_config.tools="[python_executor]" \
agent.init_config.model_name_or_path=Qwen/Qwen2.5-3B-Instruct \
agent.init_config.reward_name="code_reward" \
agent.init_config.backend_config.backend=async_verl \
agent.run_config.max_turns=5 \
agent.run_config.num_chains=4 \
agent.run_config.generation_config.max_tokens=512 \
agent.run_config.context_config.resource_backend=local \
actor_rollout_ref.actor.optim.lr=4e-7 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=mse \
actor_rollout_ref.actor.entropy_coeff=0.001 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.response_length=512 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.model.path=Qwen/Qwen2.5-3B-Instruct \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console','wandb'] \
trainer.project_name=AgentRL \
trainer.experiment_name=code_agent \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=50 \
trainer.test_freq=10 \
trainer.total_training_steps=300
Training Data Format¶
Training data should be in JSON format with the following structure:
The question field is used to form input messages to the agent, while other fields (like answer) are passed to the reward function for evaluation.
Distributed Training¶
For multi-node training, set up Ray cluster across nodes:
# On head node
ray start --head --node-ip-address=<head_ip> --port=6379
# On worker nodes
ray start --address=<head_ip>:6379
Then set trainer.nnodes and trainer.n_gpus_per_node accordingly.
Checkpointing and Resuming¶
- Auto-resume: Set
trainer.resume_mode=autoto automatically resume from the latest checkpoint - Manual resume: Set
trainer.resume_mode=resume_pathandtrainer.resume_from_path=<checkpoint_path> - Checkpoint frequency: Control with
trainer.save_freq(saves every N iterations)
Monitoring Training¶
Training metrics are logged to:
- Console: Always enabled
- Weights & Biases: Enable with trainer.logger=['console','wandb'] and set WANDB_API_KEY environment variable
Key metrics tracked: - Reward statistics (mean, std, min, max) - Policy loss - Value loss - KL divergence - Entropy - Learning rate
Tips and Best Practices¶
- Start Small: Begin with small models and batch sizes to verify setup
- Monitor KL Divergence: Keep KL divergence in check; adjust
kl_coefif it grows too large - Tune Learning Rate: Start with
5e-7to1e-6and adjust based on training stability - Batch Size: Balance between
train_batch_size,ppo_mini_batch_size, andppo_micro_batch_size_per_gpu - Memory Management: Use
gpu_memory_utilizationand FSDP offloading for large models - Advantage Estimator:
grpoworks well for most cases;gaefor on-policy scenarios - Validation: Set
trainer.val_before_train=Trueto establish baseline before training