Skip to content

LLM Inference Principles

What is LLM Inference

LLM (Large Language Model) inference refers to the process where, given an input text (Prompt), the model predicts the next token through self-attention mechanisms and probability distribution, generating output text.

Inference Flow

┌────────────────────────────────────────────────────────┐
│                    LLM Inference Flow                  │
├────────────────────────────────────────────────────────┤
│                                                        │
│   Input Text ──→ Tokenize ──→ Embed + Pos Enc ──→ Transformer │
│       │                                              │
│       │                                              │
│       ◀──────────── Sampling ◀─────────────── Output Token │
│                                                        │
│   Output Text ◀── Detokenize ◀── Vocabulary Proj ◀── Transformer │
│                                                        │
└────────────────────────────────────────────────────────┘

Stage Descriptions

StageInputOutputDescription
TokenizeTexttoken ID sequenceSplit text into model-processable units
Embedtoken IDVectorMap token to high-dimensional vector
Positional EncodingVectorPosition info addedLet model perceive token position relationships
TransformerVector sequenceHidden statesCore computation, multi-layer self-attention
Vocabulary ProjectionHidden statesVocabulary probabilityPredict next token
SamplingProbability distributiontoken IDSelect output token based on strategy
Detokenizetoken IDTextConvert token sequence back to text

Core Technical Components

1. Tokenizer

Tokenizer converts text into token sequences that the model can process.

Chinese Tokenization Example:

Input: "今天天气很好"
Output: [192, 3847, 2093, 3847, 452, 2398]  # 6 tokens

English Tokenization Example:

Input: "hello world"
Output: [15339, 1917]  # 2 tokens

2. Self-Attention

Self-attention is the core of Transformer, enabling the model to "see" relationships between any positions in the text:

Query: "I want to go"
Key:   "Beijing weather"
Value: "Beijing temperature info"

Attention score = softmax(Q × K^T / √d)

3. Sampling Strategies

StrategyDescriptionUse Case
GreedySelect token with highest probabilityDeterministic output
TemperatureControl randomness via temperatureCreative writing
Top-pSample from set with cumulative probability pBalance quality and diversity

Temperature Effects:

TemperatureEffect
T < 1.0More deterministic, conservative
T = 1.0Balanced
T > 1.0More random, creative

Inference Efficiency Optimization

KV Cache

Cache computed Key and Value to avoid redundant calculations:

Without KV Cache: Recalculate all historical tokens for each generation
With KV Cache: Only compute KV for new tokens, reuse history

Batching

Process multiple requests simultaneously to improve GPU utilization:

Single request: [User 1's input]
Batching: [User 1, User 2, User 3, ...] inputs

Quantization

Use lower precision (e.g., INT8) to reduce memory usage:

PrecisionMemoryQuality Loss
FP32100%None
FP1650%Minimal
INT825%Acceptable
INT412.5%Task-dependent

Platform Optimization Strategies

ai.TokenHub provides efficient inference through:

  • Smart Routing: Automatically select optimal model and node
  • Distributed Inference: Multi-node collaboration for large requests
  • Cache Reuse: Return cached results for identical requests
  • Traffic Scheduling: Balance load through peak-shaving