Engineering12 min read

TensorRT-LLM in Practice: A Field Guide to NVIDIA-Optimized LLM Serving

Ship faster LLM apps on NVIDIA: Step-by-step TensorRT-LLM guide with real code, quantization tips & vLLM/TGI comparisons for AI builders.

Tega Adeyemi
Tega Adeyemi
TensorRT-LLM in Practice: A Field Guide to NVIDIA-Optimized LLM Serving

From HF checkpoints to blazing-fast GPU inference with trtllm-build, Triton, and OpenAI-compatible servers—plus real-world tips, pitfalls, and comparisons with vLLM and TGI.

1. Why TensorRT-LLM Is Worth Your Attention

If you’re trying to ship LLMs in production on NVIDIA GPUs, you probably live with at least one of these realities:

TensorRT-LLM is NVIDIA’s answer to “How do we squeeze every last useful token out of these GPUs without building our own inference engine?” It gives you:

In this guide, we won’t just repeat the README. We’ll walk through:

We’ll speak as a team (“we”) because this is exactly the sort of thing we’d hash out on a shared whiteboard.

2. The Mental Model: Build vs Serve

TensorRT-LLM has two main phases:

  1. Build phase (offline)
    • Take a model checkpoint (Hugging Face, Megatron, etc.)
    • Choose precision (FP16, FP8, INT4 AWQ, …) and parallelism
    • Compile to one or more TensorRT engines (.engine files) with trtllm-build
  2. Serve phase (online)
    • Load those engines into GPU memory
    • Accept requests via:
      • Python LLM API
      • trtllm-serve (OpenAI-compatible HTTP server)
      • Triton Inference Server backend

If you remember nothing else, remember this:

You pay the price once at build time so inference can be fast & predictable forever after.

3. When to Use TensorRT-LLM vs vLLM vs TGI

Let’s be honest: you probably already have vLLM or TGI somewhere. So where does TensorRT-LLM actually shine?

Use case TensorRT-LLM vLLM TGI
Max throughput / latency on NVIDIA GPUs Excellent (TensorRT + kernel-level optimizations) Very good (PagedAttention) Good
Multi-GPU tensor parallelism First-class Partial Limited
Quantization (FP8 / INT8 / INT4 AWQ) Built-in & hardware-aware Partial (typically via external tooling) Limited / evolving
Setup complexity Higher (build step, drivers, CUDA) Low Medium
Hardware portability NVIDIA-only Any CUDA GPU Any CUDA GPU
Integration with Triton & NVIDIA stack Native Via custom wrapper Custom

We reach for TensorRT-LLM when:

If we need something super-portable or experimental, vLLM usually wins on ergonomics.

4. End-to-End: From HF Checkpoint to Live Server

Let’s walk through a concrete workflow you can actually drop into a project.

⚠️ Assumption: you have a recent NVIDIA GPU, driver, CUDA, and are inside a compatible container or env. Always check the compatibility matrix in the official docs for exact CUDA / driver combinations.

4.1 Install TensorRT-LLM

We won’t hard-code a wheel version here because it’s tightly coupled to CUDA, TensorRT, and driver version. Instead, follow the install instructions for your platform:

In many NVIDIA containers, TensorRT-LLM is pre-installed. If not, use the docs to pick the right wheel or container.

4.2 Convert / Prepare Your Model

Let’s say we want meta-llama/Meta-Llama-3-8B-Instruct from Hugging Face.

You typically:

  1. Download the checkpoint (e.g., via huggingface-cli or directly in a build container).
  2. Make sure the directory layout matches what trtllm-build expects (see “Model Preparation” in the docs).

We’ll assume:

/export/models/llama3-8b-hf/  # HF-style checkpoint

4.3 Build a TensorRT Engine with trtllm-build

trtllm-build is the official CLI to turn a checkpoint into TensorRT engines. A realistic single-node, tensor-parallel example:

trtllm-build \
  --checkpoint_dir /export/models/llama3-8b-hf \
  --output_dir /export/models/llama3-8b-trt
  # Optional: add extra flags or a config file as described
  # in the official TensorRT-LLM docs for your version.

What this does:

Practical tips:

4.4 Local Python Inference with the LLM API

Once you have engines, the simplest way to test them is via the Python LLM class provided by TensorRT-LLM.

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

config = BuildConfig(
    model=MODEL_ID,
    max_input_len=4096,
    max_output_len=256,
)

llm = LLM(config)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
)

prompt = "Explain TensorRT-LLM to a busy VP of Engineering in 3 bullet points."

# llm.generate returns an iterator over GenerationOutput objects.
outputs = llm.generate(
    [prompt],
    sampling_params=sampling_params,
)

first = next(outputs)
print(first.outputs[0].text)

A few notes:

4.5 OpenAI-Compatible HTTP Serving with trtllm-serve

For most teams, you don’t actually want to embed the LLM class into every service. You want a clean HTTP boundary.

TensorRT-LLM ships an OpenAI-compatible server, typically invoked as:

export MODEL_HANDLE="meta-llama/Meta-Llama-3-8B-Instruct"

trtllm-serve "$MODEL_HANDLE" \
  --max_batch_size 64 \
  --port 8000
  # Optional (version-dependent):
  # --trust_remote_code
  # --extra_llm_api_options /path/to/extra-llm-api-config.yml

This starts a server that accepts OpenAI-style /v1/chat/completions requests. A minimal client with openai-compatible SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key",  # server may not enforce this in dev
)

resp = client.chat.completions.create(
    model="llama3-8b-trt",  # server-side model name
    messages=[
        {"role": "user", "content": "Give me three use cases for TensorRT-LLM."}
    ],
    temperature=0.7,
)

print(resp.choices[0].message.content)

Why this is nice:

Always double-check the exact trtllm-serve options in the docs; new flags are added as the project evolves.

4.6 Scaling Out with Triton (High-Level View)

For “we have many GPUs and SLAs” scenarios, Triton Inference Server becomes interesting:

A typical Triton setup looks like:

models/
  llama3-8b-trt/
    1/
      model.plan          # or engine files
    config.pbtxt          # Triton model config
  ...

The exact config.pbtxt fields change over releases (and differ for the TensorRT-LLM backend vs pure TensorRT), so rather than paste a potentially stale snippet, we strongly recommend copying from the official Triton + TensorRT-LLM examples in the docs/GitHub repo and adjusting:

Key parameters you’ll be setting:

If you’re not comfortable maintaining Triton configs, it might be better to start with trtllm-serve and revisit Triton once you hit scaling limits.

5. Building a Production-ish Service Around TensorRT-LLM

Let’s make this more concrete: imagine you want a simple internal microservice for your org.

We’ll use:

5.1 FastAPI Wrapper Around LLM

# app.py
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

config = BuildConfig(
    model=MODEL_ID,
    max_input_len=4096,
    max_output_len=256,
)

# Build / load TensorRT-LLM engine(s) at process startup
llm = LLM(config)

app = FastAPI()

class GenerateRequest(BaseModel):
    prompt: str
    temperature: float = 0.7
    top_p: float = 0.9

class GenerateResponse(BaseModel):
    output: str

def _blocking_generate(req: GenerateRequest) -> str:
    sampling = SamplingParams(
        temperature=req.temperature,
        top_p=req.top_p,
    )
    # llm.generate returns an iterator
    outputs = llm.generate(
        [req.prompt],
        sampling_params=sampling,
    )
    first = next(outputs)
    return first.outputs[0].text

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    # Offload blocking GPU work to a thread so the event loop stays responsive
    text = await asyncio.to_thread(_blocking_generate, req)
    return GenerateResponse(output=text)

Implementation notes:

5.2 A Simple “Multi-Worker” Pattern with Ray (Optional)

If you want to saturate a single large GPU with multiple processes (or run across nodes), Ray can be a simple orchestrator.

This is an illustrative pattern, not a full recipe:

# ray_trtllm_workers.py
import ray

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

@ray.remote(num_gpus=1)
class TRTLLMWorker:
    def __init__(self):
        config = BuildConfig(
            model=MODEL_ID,
            max_input_len=4096,
            max_output_len=256,
        )
        self._llm = LLM(config)

    def generate(self, prompt: str) -> str:
        params = SamplingParams(temperature=0.7, top_p=0.9)
        outputs = self._llm.generate([prompt], sampling_params=params)
        first = next(outputs)
        return first.outputs[0].text

if __name__ == "__main__":
    ray.init()

    workers = [TRTLLMWorker.remote() for _ in range(2)]

    futures = [
        w.generate.remote(f"Hello from worker {i}")
        for i, _ in enumerate(workers)
    ]
    print(ray.get(futures))

Caveats:

6. Practical Implementation Tips (A VP-Safe Summary)

Here’s where we try to save both your developers’ time and your VP’s budget.

6.1 Quantization: How Aggressive Should You Be?

TensorRT-LLM supports different quantization schemes like FP8 and INT4 (e.g., AWQ, GPTQ variants).

Reasonable starting strategy:

  1. Start with FP16.
    • Minimal accuracy risk
    • Already gives a sizable speed / memory win over pure FP32 PyTorch.
  2. Move to FP8 if:
    • You’re on recent GPUs (e.g., Hopper) with FP8 Tensor Cores
    • Latency / throughput is more important than tiny QoR degradations.
  3. Experiment with INT4 AWQ for:
    • Large models that don’t fit otherwise
    • Use internal evals / golden tests to ensure quality is acceptable.

Avoid turning on “whatever quantization is newest” in prod without strong evals and business-level metrics.

6.2 Batching and In-Flight Batching

TensorRT-LLM supports dynamic batching and in-flight batching to pack multiple prompts efficiently.

Actionable advice:

6.3 GPU Topology & Parallelism

Some quick, battle-tested heuristics:

These decisions affect your trtllm-build config directly.

6.4 Safety & Security Considerations

Nothing here is exotic, but a few easy wins:

7. How TensorRT-LLM Actually Compares in the Wild

Let’s put all of this into a more qualitative comparison based on what we’ve seen and what NVIDIA documents.

7.1 vLLM vs TensorRT-LLM

vLLM strengths:

TensorRT-LLM strengths:

We’d summarize it as:

vLLM is ergonomics-first; TensorRT-LLM is performance-first (on NVIDIA hardware).

A lot of orgs end up with both: vLLM for R&D / feature prototyping, TensorRT-LLM for “this powers a revenue-critical product” workloads.

7.2 Hugging Face TGI vs TensorRT-LLM

TGI (Text Generation Inference):

TensorRT-LLM:

If your org is deep in Hugging Face land already and just wants something that “mostly works,” TGI can be a very pragmatic choice. If you’re squeezing infra at scale, TensorRT-LLM looks more attractive.

8. Implementation Checklist (Copy/Paste for Your Next Tech Spec)

To make this immediately usable, here’s the rough sequence we’d recommend for a new deployment:

  1. Pick a stable model
    • E.g., meta-llama/Meta-Llama-3-8B-Instruct
    • Freeze the checkpoint version for at least one deployment cycle.
  2. Build baseline engines
    • trtllm-build with FP16, tp_size matching your GPUs
    • Conservative max_input_len / max_output_len
  3. Smoke test with Python LLM API
    • Validate outputs vs HF reference on a small eval set
    • Check latency / throughput for a few batch sizes
  4. Decide serving path
    • For a single app / team → FastAPI + LLM API or trtllm-serve
    • For org-wide serving → Triton backend with autoscaling
  5. Add observability
    • Token throughput, latency percentiles, GPU utilization, OOM events
    • Log prompts & responses in a privacy-aware way for regressions
  6. Iterate on perf
    • Try FP8 / INT4 AWQ with automatic regression checks
    • Tune batch sizes and scheduler configs
    • Adjust TP / replica counts based on real traffic
  7. Compare vs a baseline (vLLM / TGI)
    • Run the same eval suite
    • Decide if the TensorRT-LLM complexity is justified for your use case.

9. Closing Thoughts

TensorRT-LLM is not the “hello world” of LLM serving. It asks more of your infra team up front:

But the payoff—when your workloads and hardware justify it—is very real: tighter latency SLOs, better GPU utilization, cleaner integration into the NVIDIA stack, and a solid path to long-term cost control.

If we were sitting in a room with your team, our short version would be:

Start simple, measure honestly, and only turn on the “crazy GPU wizardry” once the basics are solid.

And if your engineers come back saying “we hit the limits of vLLM,” this guide should give them a head-start on what comes next with TensorRT-LLM.

Tega AdeyemiDecember 1, 2025.