Hands-on Guide to Creating Your Own Multimodal LLM

A Multimodal LLM is an AI model capable of processing and integrating information from different "modalities"—most commonly text and images. While standard LLMs like GPT-3 operate solely on text tokens, a multimodal model like Claude or GPT-4o uses a Vision Encoder to turn images into "visual tokens" that the language model can understand.
Why build one?
Contextual Awareness: Models can answer questions about photos, charts, or UI screenshots.
Versatility: One model handles multiple input types, reducing the need for separate specialized pipelines.
Control: Building from scratch gives you full transparency over the weights and training data.
Building a Large Language Model (LLM) from scratch is no longer just for big tech labs. With the right tools and a deep understanding of modern architecture, you can build, train, and align a "Mini Claude AI" that understands more than just text. This guide walks you through the end-to-end journey—from the foundational transformer blocks to advanced alignment techniques like RLHF with GRPO.
Understanding the Architecture: The Modern Transformer
Traditional transformers (like the original GPT) have evolved. To create a model that is stable, fast, and capable of long-context reasoning, we incorporate three key modern components:
RMSNorm (Root Mean Square Layer Normalization): Unlike standard LayerNorm, RMSNorm doesn't center the inputs by the mean. It only re-scales them. This reduces computational overhead by ~40% and leads to more stable training at scale.
RoPE (Rotary Positional Embeddings): Instead of adding absolute positions (0, 1, 2...) to vectors, RoPE rotates the vectors in a high-dimensional space. This allows the model to naturally understand the relative distance between tokens and extrapolate to longer sequences than it was trained on.
Mixture-of-Experts (MoE): MoE replaces the dense Feed-Forward layers with multiple "experts." A router (gating network) chooses only 1 or 2 experts to process each token. This allows you to have a model with 10B parameters that only uses 1B parameters per token, making it incredibly efficient.
Theoretical Foundation: How it Works
The architecture of a Mini Claude consists of three main pillars:
The Vision Encoder: Usually a Vision Transformer (ViT). It breaks an image into patches, flattens them, and converts them into embeddings (vectors).
The Projection Layer: A linear bridge that maps the vision embeddings into the same dimensional space as the text embeddings.
The LLM (Language Model): A transformer-based decoder that receives the concatenated sequence of visual tokens and text tokens to generate a response.
Hands-on Tutorial: Implementation in PyTorch
Step 1: Modernizing the Block (RMSNorm & RoPE)
First, we replace standard components with their modern counterparts.
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.pow(2).mean(-1, keepdim=True)
x_normed = x * torch.rsqrt(norm_x + self.eps)
return x_normed * self.weight
def apply_rotary_emb(xq, xk, freqs_cis):
# RoPE logic: rotating pairs of dimensions
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
Step 2: Implementing Mixture-of-Experts (MoE)
We define a Sparse MoE layer to increase the model's capacity without increasing inference cost.
Python
class MoELayer(nn.Module):
def __init__(self, num_experts, hidden_dim):
super().__init__()
self.router = nn.Linear(hidden_dim, num_experts)
self.experts = nn.ModuleList([nn.Sequential(
nn.Linear(hidden_dim, hidden_dim * 4),
nn.SiLU(),
nn.Linear(hidden_dim * 4, hidden_dim)
) for _ in range(num_experts)])
def forward(self, x):
logits = self.router(x)
weights = torch.softmax(logits, dim=-1)
# For simplicity, top-1 routing
top_1_weights, indices = torch.topk(weights, 1, dim=-1)
# In practice, you would use scattered matmuls for efficiency
out = torch.zeros_like(x)
for i, expert in enumerate(self.experts):
mask = (indices == i).squeeze(-1)
if mask.any():
out[mask] = expert(x[mask])
return out * top_1_weights
Step 3: Supervised Fine-Tuning (SFT)
Once the base model is trained on raw text (pre-training), we perform SFT using a high-quality dataset of instruction-response pairs. This "teaches" the model how to follow a conversation.
Step 4: The Alignment (RLHF with PPO and GRPO)
To make our "Mini Claude" helpful and harmless, we use Reinforcement Learning from Human Feedback (RLHF).
PPO (Proximal Policy Optimization): The classic approach. It uses a Reward Model (to score output) and a Value Model (to estimate future rewards).
GRPO (Group Relative Policy Optimization): A modern alternative that eliminates the heavy "Value Model." It generates a group of outputs for a single prompt and scores them relative to each other. This reduces VRAM usage by ~40% and is excellent for reasoning tasks.
Python
# Conceptual GRPO Loss logic
def compute_grpo_loss(logits, rewards, kl_penalty):
# Group rewards are normalized within the group to calculate advantages
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# Standard policy gradient with advantage
loss = -(logits.log_softmax(-1) * advantages.unsqueeze(-1)).mean()
return loss + kl_penalty
Summary
You have now explored the full stack of modern LLM development. By implementing RMSNorm and RoPE, you ensured architectural stability. With MoE, you scaled the model's intelligence efficiently. Finally, with SFT and GRPO, you aligned the model to be a useful assistant. This end-to-end understanding is the foundation of building autonomous agents and custom AI systems.
Next Steps
Add Multimodality: Integrate a Vision Transformer (ViT) as an encoder and use a projection layer to map image tokens into the same space as your text tokens.
Experiment with DPO: Implement Direct Preference Optimization (DPO) to see how it compares to the RLHF (PPO/GRPO) pipeline in terms of training stability.
Deploy with vLLM: Once your PyTorch weights are saved, use the vLLM library to serve your model with high throughput and KV-caching.




