LLM Backends¶
Overview¶
AgentFly supports multiple LLM backends for text generation, each with their own configuration options. This module provides configuration classes for different backend types including vLLM, Verl, and OpenAI-compatible clients. Among them, Verl backend is designed for internal training usage. The Verl backend is the core design that decouples agent system and rl training.
Configuration Classes¶
Async VLLM Backend¶
Configuration for asynchronous vLLM backend with engine arguments:
AsyncVLLMConfig(engine_args: Optional[AsyncEngineArgs] = None, **kwargs)
dataclass
¶
Configuration for Async VLLM backend with engine arguments. Arguments are the same as vLLM's arguments, which can be found at https://docs.vllm.ai/en/latest/configuration/engine_args.html. Here listed some important arguments:
Attributes:
-
gpu_memory_utilization(float) –The fraction of GPU memory to be used for the model executor, which can range from 0 to 1.
-
max_model_len(int) –Model context length (prompt and output). If unspecified, will be automatically derived from the model config.
-
rope_scaling(dict) –Rope scaling. For example, {"rope_type":"dynamic","factor":2.0}.
-
trust_remote_code(bool) –Whether to trust remote code when loading models.
-
pipeline_parallel_size(int) –Pipeline parallel size.
-
data_parallel_size(int) –Data parallel size.
-
tensor_parallel_size(int) –Tensor parallel size.
Parameters:
-
engine_args(Optional[AsyncEngineArgs], default:None) –Optional AsyncEngineArgs instance. If provided, kwargs are ignored.
-
**kwargs–Arguments to pass to AsyncEngineArgs if engine_args is not provided.
Functions¶
Async Verl Backend¶
Configuration for asynchronous Verl backend:
AsyncVerlConfig()
dataclass
¶
Configuration for Async Verl backend.
Attributes:
Client Backend¶
Configuration for OpenAI-compatible client backends:
ClientConfig(base_url: str = 'http://localhost:8000/v1', max_requests_per_minute: int = 100, timeout: int = 3600, api_key: str = 'EMPTY')
dataclass
¶
Configuration for Client backend (OpenAI-compatible)
This configuration class provides settings for connecting to OpenAI-compatible API endpoints, such as local models served via vLLM, Ollama, or other compatible servers.
Attributes:
-
base_url(str) –The base URL for the API endpoint. Defaults to localhost:8000.
-
max_requests_per_minute(int) –Rate limiting for API requests. Defaults to 100.
-
timeout(int) –Request timeout in seconds. Defaults to 600 (10 minutes).
-
api_key(str) –API key for authentication. Defaults to "EMPTY" for local servers.
-
max_new_tokens(str) –Maximum number of tokens to generate. Defaults to 1024.
-
temperature(str) –Sampling temperature for text generation. Defaults to 1.0.
Backend Implementations¶
Base Backend¶
Abstract base class for all LLM backends:
LLMBackend
¶
Base class for LLM backends.
This abstract base class provides a unified interface for different LLM implementations. All backend implementations must inherit from this class and implement the required methods.
Attributes:
-
config–Configuration dictionary containing backend-specific parameters.
Functions¶
apply_chat_template(messages_list: List[List[Dict]], template: str, add_generation_prompt: bool = True, tools: List[Dict] = None) -> List[str]
¶
Apply chat template to messages list
prepare()
¶
Prepare the backend
generate(messages_list: str, **kwargs) -> str
¶
Generate text from prompt
generate_streaming(messages_list: List[List[Dict]], streaming_callback: Optional[Callable] = None, **kwargs) -> AsyncGenerator[str, None]
async
¶
Generate text with streaming support
preprocess()
¶
Preprocess the backend
postprocess()
¶
Postprocess the backend
Async VLLM Backend¶
Asynchronous vLLM implementation for high-performance model inference:
AsyncVLLMBackend(model_name_or_path: str, template: str, **kwargs)
¶
Bases: LLMBackend
Asynchronous vLLM implementation for high-performance model inference.
This backend uses the vLLM AsyncLLMEngine for asynchronous inference, providing better resource utilization and scalability for concurrent requests.
Parameters:
-
model_name_or_path(str) –Name or path of the pre-trained model to load.
-
template(str) –Chat template to use for formatting messages.
-
temperature(float) –Sampling temperature for text generation. Defaults to 1.0.
-
max_new_tokens(int) –Maximum number of new tokens to generate. Defaults to 1024.
-
**kwargs–Additional configuration parameters that will be passed to AsyncEngineArgs.
Async Verl Backend¶
Asynchronous Verl implementation for distributed model inference:
AsyncVerlBackend(llm_engine, model_name_or_path: str, template: str)
¶
Bases: LLMBackend
Asynchronous Verl implementation for distributed model inference.
This backend uses the Verl framework for distributed and asynchronous model inference. Verl provides capabilities for running models across multiple workers and handling complex inference pipelines.
Parameters:
-
llm_engine–Verl engine instance for distributed inference.
-
model_name_or_path(str) –Name or path of the pre-trained model to load.
-
template(str) –Chat template to use for formatting messages.
-
**kwargs–Additional configuration parameters.
Client Backend¶
OpenAI-compatible client backend for remote API inference:
ClientBackend(model_name_or_path: str, base_url: str = 'http://localhost:8000/v1', max_requests_per_minute: int = 100, timeout: int = 3600, api_key: str = 'EMPTY')
¶
Bases: LLMBackend
OpenAI-compatible and Google Gemini client backend for remote API inference.
This backend provides a thin wrapper around OpenAI-compatible chat APIs and Google Gemini API, supporting both synchronous and asynchronous operations. It includes built-in rate limiting and retry mechanisms for reliable API communication.
Parameters:
-
model_name_or_path(str) –Name of the model to use for inference.
-
template(str) –Chat template to use for formatting messages.
-
base_url(str, default:'http://localhost:8000/v1') –Base URL for the API endpoint. Defaults to localhost:8000.
-
max_requests_per_minute(int, default:100) –Rate limiting for API requests. Defaults to 100.
-
timeout(int, default:3600) –Request timeout in seconds. Defaults to 600.
-
api_key(str, default:'EMPTY') –API key for authentication. Defaults to "EMPTY" for local servers.
-
max_new_tokens(int) –Maximum number of new tokens to generate. Defaults to 1024.
-
**kwargs–Additional configuration parameters.
Functions¶
generate_streaming(messages: List[List[Dict]], **kwargs) -> AsyncGenerator[str, None]
async
¶
This is actually the not streaming. We simply return the generated text.
generate(messages: List[List[Dict]] | List[Dict], return_dict: bool = False, **kwargs) -> List[str] | asyncio.Task
¶
• Pass a list of messages → single completion. • Pass a list of list of messages → batch completions (max parallelism).
Returns:
-
List[str] | Task–• In an async context → awaitable Task (so caller writes
await backend.generate(...)). -
List[str] | Task–• In a sync context → real list of strings (blocks until done).
Usage Examples¶
Backends are designed to work together with agents. Here are examples showing how to configure different backends when creating agents:
Async VLLM Backend¶
from agentfly.agents import HFAgent
from agentfly.tools import calculator
from agentfly.rewards import math_equal_reward_tool
from agentfly.utils.llm_backends import AsyncVLLMConfig
agent = HFAgent(
model_name_or_path="Qwen/Qwen2.5-3B-Instruct",
tools=[calculator],
reward_fn=math_equal_reward_tool,
template="qwen2.5",
backend_config=AsyncVLLMConfig(
pipeline_parallel_size=2,
data_parallel_size=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.8
)
)
Client Backend (OpenAI-compatible)¶
from agentfly.agents import HFAgent
from agentfly.tools import calculator
from agentfly.rewards import math_equal_reward_tool
from agentfly.utils.llm_backends import ClientConfig
agent = HFAgent(
model_name_or_path="Qwen/Qwen2.5-3B-Instruct",
tools=[calculator],
reward_fn=math_equal_reward_tool,
template="qwen2.5",
backend_config=ClientConfig(
base_url="http://localhost:8000/v1",
api_key="your-api-key",
max_requests_per_minute=200,
timeout=300,
temperature=0.7,
max_new_tokens=1024
)
)