diff --git a/contrib/models/InternVL3-8B-Instruct/README.md b/contrib/models/InternVL3-8B-Instruct/README.md
new file mode 100644
index 00000000..d12f5698
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/README.md
@@ -0,0 +1,212 @@
+# Contrib Model: InternVL3-8B-Instruct
+
+InternVL3-8B-Instruct is a vision-language model (VLM) running on AWS Trainium2 via NxD Inference. It supports both text-only and multimodal (text + image) inference using the NeuronBaseForImageToText framework.
+
+**Maintainer:** Jim Burtoft ([@jimburtoft](https://github.com/jimburtoft))
+
+## Model Information
+
+- **HuggingFace ID:** [`OpenGVLab/InternVL3-8B-Instruct`](https://huggingface.co/OpenGVLab/InternVL3-8B-Instruct)
+- **Model Type:** Vision-language model (decoder-only transformer with vision encoder)
+- **Parameters:** ~8B total (InternViT-300M vision encoder + Qwen2.5-7B text backbone)
+- **Architecture:** GQA (28 heads / 4 KV heads), RoPE, RMSNorm (text); LayerNorm, GELU, absolute position embeddings (vision); pixel shuffle downsampling + 2-layer MLP projector
+- **License:** MIT (Apache-2.0 for Qwen2.5 component)
+- **Precision:** BF16
+
+### Architecture Overview
+
+| Component | Details |
+|-----------|---------|
+| **Vision encoder** | InternViT-300M-448px-V2.5: 24 layers, hidden=1024, 16 heads, patch_size=14, image_size=448 |
+| **Projector** | Pixel shuffle (downsample_ratio=0.5, 1024→256 tokens) + LayerNorm + Linear(4096, 3584) + GELU + Linear(3584, 3584) |
+| **Text backbone** | Qwen2.5-7B: 28 layers, hidden=3584, intermediate=18944, GQA (28/4), vocab=151674, tie_word_embeddings=False |
+
+## Validation Results
+
+**Validated:** 2026-04-28
+**Instance:** trn2.3xlarge (LNC=2, TP=4)
+**SDK:** Neuron SDK 2.29 (NxDI 0.9.17334, neuronx-cc 2.24.5133.0, PyTorch 2.9)
+
+### Benchmark Results
+
+#### Performance (TP=4, batch_size=1)
+
+| Sequence Length | TTFT (ms) | TKG Throughput (tok/s) |
+|----------------|-----------|------------------------|
+| 2048 | 138 | 75.1 |
+| 4096 | 230 | 58.9 |
+| 8192 | 482 | 40.0 |
+| 16384 | 1019 | 23.6 |
+| 32768 | 2438 | 11.4 |
+
+Vision encoder latency: 34.5 ms per 448x448 tile (batch=1).
+
+#### GPU Comparison (1x NVIDIA L40S, BF16, SDPA)
+
+| Metric | GPU (L40S) | Neuron (trn2.3xlarge TP=4) | Speedup |
+|--------|------------|---------------------------|---------|
+| TTFT (2048 input tokens) | 153.5 ms | 138 ms | 1.11x |
+| Output tok/s (BS=1) | 40.5 | 75.1 | **1.85x** |
+
+### Accuracy Validation
+
+| Test | Status | Metrics |
+|------|--------|---------|
+| CTE logit comparison (vs CPU FP32) | PASS | cosine=0.9984, top-1 match, top-5 5/5, top-10 8/10 |
+| TKG text generation | PASS | Correct, coherent output ("The capital of France is Paris.") |
+| Multimodal generation | PASS | Vision encoder + text pipeline end-to-end working |
+
+## Usage
+
+### Prerequisites
+
+```bash
+# Activate NxDI environment
+source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+
+# Download model
+huggingface-cli download OpenGVLab/InternVL3-8B-Instruct --local-dir /mnt/models/InternVL3-8B-Instruct/
+```
+
+### Compile and Run
+
+```python
+import sys
+import torch
+from pathlib import Path
+from transformers import AutoTokenizer
+
+# Add contrib src to path
+sys.path.insert(0, str(Path("contrib/models/InternVL3-8B-Instruct/src")))
+
+from modeling_internvl3 import NeuronInternVL3ForCausalLM, InternVL3InferenceConfig
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_PATH = "/mnt/models/neuron_models/InternVL3-8B-Instruct/"
+
+# Configure
+text_neuron_config = NeuronConfig(
+    tp_degree=4,
+    max_batch_size=1,
+    seq_len=2048,
+    torch_dtype=torch.bfloat16,
+    on_device_sampling_config=None,
+    save_sharded_checkpoint=True,
+)
+vision_neuron_config = NeuronConfig(
+    tp_degree=4,
+    max_batch_size=1,
+    seq_len=256,
+    torch_dtype=torch.bfloat16,
+    on_device_sampling_config=None,
+    buckets=[1],
+    fused_qkv=True,
+    save_sharded_checkpoint=True,
+)
+
+config = InternVL3InferenceConfig.from_pretrained(
+    MODEL_PATH,
+    text_neuron_config=text_neuron_config,
+    vision_neuron_config=vision_neuron_config,
+)
+
+# Compile (first time only)
+model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+model.compile(COMPILED_PATH)
+
+# Load and generate
+model.load(COMPILED_PATH)
+
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+prompt = "The capital of France is"
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+
+outputs = model(input_ids=input_ids)
+next_token = outputs.logits[0, -1].argmax().item()
+print(f"{prompt} {tokenizer.decode([next_token])}")
+# Output: The capital of France is Paris
+```
+
+### Multimodal Inference
+
+```python
+# Build input with vision tokens
+IMG_CONTEXT_ID = 151667  # <IMG_CONTEXT>
+IMG_START_ID = 151665    # <img>
+IMG_END_ID = 151666      # </img>
+
+text_ids = tokenizer("Describe this image:", return_tensors="pt").input_ids[0]
+img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+
+input_ids = torch.cat([
+    text_ids,
+    torch.tensor([IMG_START_ID]),
+    img_tokens,
+    torch.tensor([IMG_END_ID]),
+]).unsqueeze(0)
+
+# Pixel values for a single 448x448 tile
+pixel_values = preprocess_image(image)  # [1, 3, 448, 448]
+
+outputs = model(input_ids=input_ids, pixel_values=pixel_values)
+```
+
+## Compatibility Matrix
+
+| Instance | SDK 2.29 | SDK 2.28 |
+|----------|----------|----------|
+| trn2.3xlarge (LNC=2, TP=4) | **VALIDATED** | Not tested |
+
+## Example Checkpoints
+
+* [OpenGVLab/InternVL3-8B-Instruct](https://huggingface.co/OpenGVLab/InternVL3-8B-Instruct)
+
+## Testing Instructions
+
+```bash
+# Ensure model is compiled first (see Usage above), then:
+cd contrib/models/InternVL3-8B-Instruct/
+pytest test/integration/test_model.py -v --tb=short
+```
+
+## Implementation Notes
+
+### Three-File VLM Architecture
+
+The model uses the NxDI `NeuronBaseForImageToText` framework with three files:
+
+- `src/modeling_internvl3.py` — Top-level VLM orchestrating vision + text
+- `src/modeling_internvl3_text.py` — Text model (Qwen2.5-7B) with vision embedding injection via `scatter_by_index_put()`
+- `src/modeling_internvl3_vision.py` — Vision encoder (InternViT-300M) compiled via `torch_neuronx.trace()` with pixel shuffle and MLP projector
+
+### Weight Mapping
+
+| HuggingFace Key | NxDI Key |
+|-----------------|----------|
+| `language_model.model.layers.{i}.*` | `layers.{i}.*` |
+| `language_model.model.embed_tokens.weight` | `embed_tokens.weight` |
+| `language_model.model.norm.weight` | `norm.weight` |
+| `language_model.lm_head.weight` | `lm_head.weight` |
+| `vision_model.*` | Vision encoder (separate NEFF) |
+| `mlp1.*` | Projector (part of vision NEFF) |
+
+### Special Tokens
+
+| Token | ID | Purpose |
+|-------|-----|---------|
+| `<IMG_CONTEXT>` | 151667 | Visual token placeholder in text sequence |
+| `<img>` | 151665 | Image region start marker |
+| `</img>` | 151666 | Image region end marker |
+| `<|im_end|>` | 151645 | EOS token |
+
+## Known Issues
+
+- **V2PE not implemented**: Variable Visual Position Encoding (described in InternVL2.5/3 papers) is not implemented in the HuggingFace model code and is not included here. Standard position IDs are used. Accuracy validation passes without V2PE.
+- **Batch size > 1**: Single-request batch inference (batch_size > 1) has a known issue with sampling_params shape. Use vLLM for multi-request concurrent serving.
+- **trust_remote_code**: The HuggingFace tokenizer requires `trust_remote_code=True`. The NxDI model code reads config.json directly and does not require it.
+- **NKI kernels**: Not applicable for this model. Qwen2.5-7B's `intermediate_size=18944` is incompatible with NxDI NKI `mlp_kernel` and `attn_block_tkg` kernels at tested TP degrees.
+
+## vLLM Integration
+
+This model can be served through vLLM-neuron with patches to the vllm-neuron worker. See the [NxD Inference vLLM User Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html) for general vLLM setup. InternVL3 requires modifications to vllm-neuron's model loader and runner to register the custom architecture. Contact the maintainer for patch details.
diff --git a/contrib/models/InternVL3-8B-Instruct/accuracy_test.py b/contrib/models/InternVL3-8B-Instruct/accuracy_test.py
new file mode 100644
index 00000000..a2f28e66
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/accuracy_test.py
@@ -0,0 +1,499 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Apples-to-apples TKG accuracy comparison: CPU reference vs Neuron.
+
+Generates identical tokenized inputs, runs both CPU and Neuron generation,
+and compares tokens position-by-position.
+
+Phase 1: Generate CPU references (slow, ~5-15 min per prompt for 7B in bf16)
+Phase 2: Run Neuron generation with same inputs
+Phase 3: Compare token-by-token
+
+Usage:
+    # Generate CPU references (run once, saves to disk)
+    python accuracy_test.py --cpu-ref
+
+    # Run Neuron comparison against saved references
+    python accuracy_test.py --neuron --skip-compile
+
+    # Run both (full test)
+    python accuracy_test.py --full --skip-compile
+"""
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import torch
+
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_PATH = "/mnt/models/neuron_models/InternVL3-8B-VLM/"
+REF_DIR = "/mnt/models/accuracy_refs/"
+TEST_IMAGE = "/mnt/models/test_image.png"
+
+# Test prompts
+TEST_PROMPTS = [
+    {
+        "name": "france_capital",
+        "prompt": "What is the capital of France?",
+        "type": "text",
+        "max_new_tokens": 32,
+    },
+    {
+        "name": "meaning_of_life",
+        "prompt": "I believe the meaning of life is",
+        "type": "text",
+        "max_new_tokens": 32,
+    },
+    {
+        "name": "simple_math",
+        "prompt": "What is 2 + 2? Answer with just the number:",
+        "type": "text",
+        "max_new_tokens": 10,
+    },
+    {
+        "name": "describe_image",
+        "prompt": "Describe this image briefly.",
+        "type": "multimodal",
+        "max_new_tokens": 32,
+    },
+]
+
+# Special tokens
+IMG_CONTEXT_ID = 151667
+IMG_START_ID = 151665
+IMG_END_ID = 151666
+
+
+def build_inputs(prompt_info, tokenizer, pixel_values=None):
+    """Build identical inputs for both CPU and Neuron using chat template."""
+    prompt = prompt_info["prompt"]
+
+    # Use chat template for proper instruction formatting
+    messages = [{"role": "user", "content": prompt}]
+    templated = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    inputs = tokenizer(templated, return_tensors="pt")
+    input_ids = inputs.input_ids
+
+    if prompt_info["type"] == "multimodal" and pixel_values is not None:
+        # Insert image tokens before the text: <img> <IMG_CONTEXT>*256 </img> <text>
+        # InternVL3 convention: image tokens come in the user message
+        text_ids = input_ids[0]
+        img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+        img_seq = torch.cat(
+            [
+                torch.tensor([IMG_START_ID]),
+                img_tokens,
+                torch.tensor([IMG_END_ID]),
+            ]
+        )
+        # Insert image tokens right after the "user\n" tokens
+        # Find position of user content in the templated string
+        input_ids = torch.cat([text_ids, img_seq]).unsqueeze(0)
+
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, attention_mask
+
+
+def generate_cpu_references(tokenizer):
+    """Generate reference outputs using HF model on CPU (slow but golden).
+
+    For text-only prompts: uses the inner Qwen2ForCausalLM language model directly.
+    For multimodal: uses InternVLChatModel.chat() which handles vision processing.
+    """
+    print("=" * 60)
+    print("PHASE 1: Generating CPU reference outputs")
+    print("=" * 60)
+    print("WARNING: This is slow (~5-15 min per prompt for 7B bf16 on CPU)")
+
+    from transformers import AutoModelForCausalLM
+
+    print("\nLoading HF InternVLChatModel on CPU (bf16)...")
+    start = time.time()
+    vlm_model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=torch.bfloat16,
+        trust_remote_code=True,
+        device_map="cpu",
+    )
+    vlm_model.eval()
+    # InternVLChatModel.generate() asserts img_context_token_id is set
+    vlm_model.img_context_token_id = IMG_CONTEXT_ID
+    # Get the inner language model for text-only generation
+    lm = vlm_model.language_model
+    lm.eval()
+    print(f"Model loaded in {time.time() - start:.1f}s")
+
+    # Load test image
+    pixel_values = None
+    if os.path.exists(TEST_IMAGE):
+        from PIL import Image
+        from torchvision import transforms
+
+        img = Image.open(TEST_IMAGE).convert("RGB")
+        transform = transforms.Compose(
+            [
+                transforms.Resize((448, 448)),
+                transforms.ToTensor(),
+                transforms.Normalize(
+                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+                ),
+            ]
+        )
+        pixel_values = transform(img).unsqueeze(0).to(torch.bfloat16)
+        print(f"Test image loaded: {TEST_IMAGE}")
+
+    os.makedirs(REF_DIR, exist_ok=True)
+    results = {}
+
+    # EOS for InternVL3 is <|im_end|> = 151645
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    print(f"Using eos_token_id={eos_token_id} (<|im_end|>)")
+
+    for pinfo in TEST_PROMPTS:
+        name = pinfo["name"]
+        if pinfo["type"] == "multimodal" and pixel_values is None:
+            print(f"\nSkipping {name} (no test image)")
+            continue
+
+        # Skip if reference already exists
+        ref_path = os.path.join(REF_DIR, f"{name}.json")
+        if os.path.exists(ref_path):
+            print(f"\n--- {name}: already exists, skipping ---")
+            with open(ref_path) as f:
+                results[name] = json.load(f)
+            continue
+
+        print(f"\n--- {name} ---")
+        pv = pixel_values if pinfo["type"] == "multimodal" else None
+        input_ids, attention_mask = build_inputs(pinfo, tokenizer, pv)
+        prompt_len = input_ids.shape[-1]
+        print(f"Input length: {prompt_len} tokens")
+
+        # Generate on CPU
+        print(f"Generating {pinfo['max_new_tokens']} tokens on CPU...")
+        start = time.time()
+        with torch.no_grad():
+            if pinfo["type"] == "multimodal" and pv is not None:
+                # Use full VLM model for multimodal (handles vision encoder)
+                output_ids = vlm_model.generate(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask,
+                    pixel_values=pv,
+                    max_new_tokens=pinfo["max_new_tokens"],
+                    do_sample=False,
+                    eos_token_id=eos_token_id,
+                )
+            else:
+                # Use inner LM directly for text-only (cleaner, avoids custom generate)
+                output_ids = lm.generate(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask,
+                    max_new_tokens=pinfo["max_new_tokens"],
+                    do_sample=False,
+                    eos_token_id=eos_token_id,
+                )
+        elapsed = time.time() - start
+        print(f"Generated in {elapsed:.1f}s")
+
+        generated_ids = output_ids[0, prompt_len:].tolist()
+        generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+        print(f"Output: {generated_text!r}")
+
+        ref = {
+            "name": name,
+            "prompt": pinfo["prompt"],
+            "type": pinfo["type"],
+            "input_ids": input_ids[0].tolist(),
+            "prompt_len": prompt_len,
+            "generated_ids": generated_ids,
+            "generated_text": generated_text,
+            "max_new_tokens": pinfo["max_new_tokens"],
+            "cpu_time_s": round(elapsed, 1),
+        }
+
+        ref_path = os.path.join(REF_DIR, f"{name}.json")
+        with open(ref_path, "w") as f:
+            json.dump(ref, f, indent=2)
+        print(f"Saved: {ref_path}")
+        results[name] = ref
+
+    del vlm_model, lm
+    torch.cuda.empty_cache() if torch.cuda.is_available() else None
+    return results
+
+
+def run_neuron_comparison(tokenizer, skip_compile=False):
+    """Run Neuron generation and compare against CPU references."""
+    print("\n" + "=" * 60)
+    print("PHASE 2: Neuron generation + comparison")
+    print("=" * 60)
+
+    from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+    from neuronx_distributed_inference.models.config import NeuronConfig
+    from neuronx_distributed_inference.utils.hf_adapter import (
+        HuggingFaceGenerationAdapter,
+    )
+
+    # Create config and load model
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=2048,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+    if not skip_compile:
+        model.compile(COMPILED_PATH)
+    model.load(COMPILED_PATH)
+
+    adapter = HuggingFaceGenerationAdapter(model)
+
+    # Load test image
+    pixel_values = None
+    if os.path.exists(TEST_IMAGE):
+        from PIL import Image
+        from torchvision import transforms
+
+        img = Image.open(TEST_IMAGE).convert("RGB")
+        transform = transforms.Compose(
+            [
+                transforms.Resize((448, 448)),
+                transforms.ToTensor(),
+                transforms.Normalize(
+                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+                ),
+            ]
+        )
+        pixel_values = transform(img).unsqueeze(0)
+
+    all_results = {}
+
+    for pinfo in TEST_PROMPTS:
+        name = pinfo["name"]
+        ref_path = os.path.join(REF_DIR, f"{name}.json")
+        if not os.path.exists(ref_path):
+            print(f"\n--- {name}: SKIP (no CPU reference) ---")
+            continue
+
+        with open(ref_path) as f:
+            ref = json.load(f)
+
+        print(f"\n--- {name} ---")
+        print(f"Prompt: {ref['prompt']!r}")
+
+        # Build IDENTICAL inputs
+        pv = pixel_values if pinfo["type"] == "multimodal" else None
+        input_ids, attention_mask = build_inputs(pinfo, tokenizer, pv)
+        prompt_len = input_ids.shape[-1]
+
+        # Verify inputs match reference
+        ref_input_ids = ref["input_ids"]
+        actual_input_ids = input_ids[0].tolist()
+        if actual_input_ids != ref_input_ids:
+            print(f"WARNING: Input IDs differ!")
+            print(f"  Neuron: {actual_input_ids[:10]}... (len={len(actual_input_ids)})")
+            print(f"  CPU:    {ref_input_ids[:10]}... (len={len(ref_input_ids)})")
+
+        # Generate on Neuron
+        gen_kwargs = dict(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=pinfo["max_new_tokens"],
+            do_sample=False,
+        )
+        if pv is not None:
+            gen_kwargs["pixel_values"] = pv
+
+        start = time.time()
+        output_ids = adapter.generate(**gen_kwargs)
+        neuron_time = time.time() - start
+
+        neuron_generated_ids = output_ids[0, prompt_len:].tolist()
+        neuron_text = tokenizer.decode(neuron_generated_ids, skip_special_tokens=True)
+
+        # Compare token-by-token (trim at first EOS for both sides)
+        cpu_ids = ref["generated_ids"]
+        eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+
+        # Trim at first EOS
+        def trim_at_eos(ids, eos):
+            for i, t in enumerate(ids):
+                if t == eos:
+                    return ids[:i]
+            return ids
+
+        neuron_trimmed = trim_at_eos(neuron_generated_ids, eos_id)
+        cpu_trimmed = trim_at_eos(cpu_ids, eos_id)
+
+        n_compare = min(
+            max(len(neuron_trimmed), len(cpu_trimmed)),
+            pinfo["max_new_tokens"],
+        )
+        # Use max length for comparison (missing tokens count as mismatches)
+        if n_compare == 0:
+            n_compare = max(len(neuron_generated_ids), len(cpu_ids))
+
+        matches = 0
+        first_mismatch = None
+        details = []
+        for i in range(n_compare):
+            n_id = neuron_trimmed[i] if i < len(neuron_trimmed) else -1
+            c_id = cpu_trimmed[i] if i < len(cpu_trimmed) else -1
+            match = n_id == c_id
+            if match:
+                matches += 1
+            elif first_mismatch is None:
+                first_mismatch = i
+
+            n_tok = tokenizer.decode([n_id]) if n_id >= 0 else "<EOS>"
+            c_tok = tokenizer.decode([c_id]) if c_id >= 0 else "<EOS>"
+            details.append(
+                {
+                    "pos": i,
+                    "match": match,
+                    "neuron_id": n_id,
+                    "neuron_token": n_tok,
+                    "cpu_id": c_id,
+                    "cpu_token": c_tok,
+                }
+            )
+
+        match_rate = matches / n_compare if n_compare > 0 else 0
+
+        # Compute prefix match (up to shorter EOS-trimmed sequence)
+        min_len = min(len(neuron_trimmed), len(cpu_trimmed))
+        prefix_matches = 0
+        for i in range(min_len):
+            if neuron_trimmed[i] == cpu_trimmed[i]:
+                prefix_matches += 1
+            else:
+                break  # Stop at first divergence for prefix match
+        prefix_rate = prefix_matches / min_len if min_len > 0 else 1.0
+
+        # Print results
+        print(f"CPU output:    {ref['generated_text']!r}")
+        print(f"Neuron output: {neuron_text!r}")
+        print(
+            f"Tokens (EOS-trimmed): neuron={len(neuron_trimmed)}, cpu={len(cpu_trimmed)}"
+        )
+        print(f"Prefix match:  {prefix_matches}/{min_len} ({prefix_rate:.1%})")
+        print(f"Full match:    {matches}/{n_compare} ({match_rate:.1%})")
+        if first_mismatch is not None:
+            d = details[first_mismatch]
+            print(
+                f"First mismatch at pos {first_mismatch}: "
+                f"neuron={d['neuron_token']!r} vs cpu={d['cpu_token']!r}"
+            )
+
+        # Detailed per-position output
+        for d in details:
+            flag = "MATCH" if d["match"] else "MISS "
+            print(
+                f"  pos {d['pos']:2d}: {flag} | neuron={d['neuron_token']!r:12s} cpu={d['cpu_token']!r:12s}"
+            )
+
+        # Status based on prefix match (more meaningful for EOS-trimmed comparison)
+        status = (
+            "PASS" if prefix_rate >= 0.9 else ("WARN" if prefix_rate >= 0.5 else "FAIL")
+        )
+        print(f"Status: {status}")
+
+        all_results[name] = {
+            "status": status,
+            "prefix_match": f"{prefix_matches}/{min_len} ({prefix_rate:.1%})",
+            "full_match": f"{matches}/{n_compare} ({match_rate:.1%})",
+            "neuron_tokens": len(neuron_trimmed),
+            "cpu_tokens": len(cpu_trimmed),
+            "first_mismatch_pos": first_mismatch,
+            "neuron_text": neuron_text[:200],
+            "cpu_text": ref["generated_text"][:200],
+            "neuron_time_s": round(neuron_time, 3),
+            "cpu_time_s": ref.get("cpu_time_s"),
+            "details": details,
+        }
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("ACCURACY COMPARISON SUMMARY")
+    print("=" * 60)
+    overall_pass = True
+    for name, r in all_results.items():
+        print(
+            f"  {name:25s} {r['status']:5s}  prefix={r['prefix_match']}  full={r['full_match']}"
+        )
+        if r["status"] == "FAIL":
+            overall_pass = False
+
+    result_path = os.path.join(REF_DIR, "accuracy_results.json")
+    with open(result_path, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {result_path}")
+
+    return overall_pass, all_results
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Apples-to-apples accuracy test")
+    parser.add_argument(
+        "--cpu-ref", action="store_true", help="Generate CPU references only"
+    )
+    parser.add_argument(
+        "--neuron", action="store_true", help="Run Neuron comparison only"
+    )
+    parser.add_argument(
+        "--full", action="store_true", help="Run both CPU ref + Neuron comparison"
+    )
+    parser.add_argument(
+        "--skip-compile", action="store_true", help="Skip Neuron compilation"
+    )
+    args = parser.parse_args()
+
+    if not any([args.cpu_ref, args.neuron, args.full]):
+        parser.print_help()
+        print("\nSpecify --cpu-ref, --neuron, or --full")
+        sys.exit(1)
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    if args.cpu_ref or args.full:
+        generate_cpu_references(tokenizer)
+
+    if args.neuron or args.full:
+        passed, results = run_neuron_comparison(tokenizer, args.skip_compile)
+        sys.exit(0 if passed else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/InternVL3-8B-Instruct/compile_internvl3_vlm.py b/contrib/models/InternVL3-8B-Instruct/compile_internvl3_vlm.py
new file mode 100644
index 00000000..4e26b846
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/compile_internvl3_vlm.py
@@ -0,0 +1,193 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Compile InternVL3-8B-Instruct full VLM on Neuron.
+
+This compiles three NEFFs:
+1. Vision encoder: InternViT-300M + pixel shuffle + MLP projector
+2. Text CTE: Qwen2.5-7B context encoding (with vision embedding injection)
+3. Text TKG: Qwen2.5-7B token generation
+
+Usage:
+    python compile_internvl3_vlm.py [--text-only] [--vision-only]
+
+Target: trn2.3xlarge LNC=2 TP=4
+"""
+
+import argparse
+import sys
+import time
+from pathlib import Path
+
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_PATH = "/mnt/models/neuron_models/InternVL3-8B-Instruct/"
+
+
+def create_config():
+    """Create InternVL3 VLM inference config with text + vision NeuronConfigs."""
+
+    # Text NeuronConfig: TP=4 on trn2.3xlarge LNC=2
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=2048,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+
+    # Vision NeuronConfig: Must match text TP degree (NxDI requirement)
+    # Vision encoder is small; weights are replicated across TP ranks.
+    # Bucket = [1] (one image at a time)
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,  # 256 vision tokens after pixel shuffle
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],  # number of images
+        fused_qkv=True,  # vision encoder has fused QKV weights
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    return config
+
+
+def compile_and_test():
+    """Compile full VLM and run smoke tests."""
+    print("=" * 60)
+    print("InternVL3-8B-Instruct: Full VLM Compilation")
+    print("=" * 60)
+
+    config = create_config()
+
+    print(f"\nModel path: {MODEL_PATH}")
+    print(f"Compiled path: {COMPILED_PATH}")
+    print(f"\n--- Text Config ---")
+    print(f"  TP degree: {config.text_config.neuron_config.tp_degree}")
+    print(f"  Seq len: {config.text_config.neuron_config.seq_len}")
+    print(f"  Batch size: {config.text_config.neuron_config.max_batch_size}")
+    print(f"  hidden_size: {config.text_config.hidden_size}")
+    print(f"  num_hidden_layers: {config.text_config.num_hidden_layers}")
+    print(f"  vocab_size: {config.text_config.vocab_size}")
+    print(f"\n--- Vision Config ---")
+    print(f"  TP degree: {config.vision_config.neuron_config.tp_degree}")
+    print(f"  Buckets: {config.vision_config.neuron_config.buckets}")
+
+    # Create model
+    print("\n--- Creating model ---")
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+
+    # Compile
+    print("\n--- Compiling (text + vision) ---")
+    start = time.time()
+    model.compile(COMPILED_PATH)
+    elapsed = time.time() - start
+    print(f"\nCompilation completed in {elapsed:.1f}s ({elapsed / 60:.1f} min)")
+
+    # Load
+    print("\n--- Loading compiled model ---")
+    start = time.time()
+    model.load(COMPILED_PATH)
+    elapsed = time.time() - start
+    print(f"Load completed in {elapsed:.1f}s")
+
+    # Smoke test 1: Text-only
+    print("\n--- Smoke test: text-only ---")
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    prompt = "The capital of France is"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    seq_len = input_ids.shape[-1]
+    position_ids = torch.arange(seq_len, dtype=torch.int32).unsqueeze(0)
+    seq_ids = torch.zeros(1, dtype=torch.int32)
+
+    print(f"Prompt: {prompt}")
+    with torch.no_grad():
+        outputs = model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+        )
+        logits = outputs.logits
+        top5 = torch.topk(logits[0, -1].float(), 5)
+        print("Top-5 next tokens:")
+        for t, v in zip(top5.indices, top5.values):
+            print(
+                f"  {tokenizer.decode([t.item()])!r} (id={t.item()}, logit={v.item():.4f})"
+            )
+
+    # Smoke test 2: Multimodal (synthetic image)
+    print("\n--- Smoke test: multimodal ---")
+    # Build input with 256 <IMG_CONTEXT> tokens
+    IMG_CONTEXT_ID = 151667
+    IMG_START_ID = 151665
+    IMG_END_ID = 151666
+
+    text_before = "Describe this image:"
+    text_before_ids = tokenizer(text_before, return_tensors="pt").input_ids[0]
+
+    # Construct: <text> <img> <IMG_CONTEXT>*256 </img>
+    img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+    full_ids = torch.cat(
+        [
+            text_before_ids,
+            torch.tensor([IMG_START_ID]),
+            img_tokens,
+            torch.tensor([IMG_END_ID]),
+        ]
+    ).unsqueeze(0)
+
+    full_seq_len = full_ids.shape[-1]
+    full_position_ids = torch.arange(full_seq_len, dtype=torch.int32).unsqueeze(0)
+    full_seq_ids = torch.zeros(1, dtype=torch.int32)
+
+    # Synthetic pixel values (random noise as placeholder)
+    pixel_values = torch.randn(1, 3, 448, 448)
+
+    print(f"Input IDs shape: {full_ids.shape}")
+    print(f"Pixel values shape: {pixel_values.shape}")
+    print(
+        f"Number of <IMG_CONTEXT> tokens: {(full_ids == IMG_CONTEXT_ID).sum().item()}"
+    )
+
+    with torch.no_grad():
+        outputs = model(
+            input_ids=full_ids,
+            position_ids=full_position_ids,
+            seq_ids=full_seq_ids,
+            pixel_values=pixel_values,
+        )
+        logits = outputs.logits
+        top5 = torch.topk(logits[0, -1].float(), 5)
+        print("Top-5 next tokens (multimodal):")
+        for t, v in zip(top5.indices, top5.values):
+            print(
+                f"  {tokenizer.decode([t.item()])!r} (id={t.item()}, logit={v.item():.4f})"
+            )
+
+    print("\n" + "=" * 60)
+    print("SUCCESS: InternVL3 full VLM compiled and running on Neuron")
+    print("=" * 60)
+
+
+if __name__ == "__main__":
+    compile_and_test()
diff --git a/contrib/models/InternVL3-8B-Instruct/gpu_benchmark.py b/contrib/models/InternVL3-8B-Instruct/gpu_benchmark.py
new file mode 100644
index 00000000..115eec05
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/gpu_benchmark.py
@@ -0,0 +1,457 @@
+#!/usr/bin/env python3
+"""GPU benchmark for InternVL3-8B-Instruct.
+
+Measures TTFT, output tok/s, E2E latency across batch sizes and optimizations.
+Follows GPU Benchmark Standard from OpencodeDocs/steering/gpu-benchmark-standard.md.
+"""
+
+import argparse
+import json
+import time
+import gc
+import os
+
+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
+
+# Monkey-patch for transformers 5.7.0 compatibility with InternVL3 custom model.
+# InternVLChatModel.__init__ doesn't call super().__init__ in a way that sets
+# the new all_tied_weights_keys attribute required by transformers >= 5.7.0.
+_orig_ptm_init = PreTrainedModel.__init__
+
+
+def _patched_ptm_init(self, *args, **kwargs):
+    _orig_ptm_init(self, *args, **kwargs)
+    if not hasattr(self, "all_tied_weights_keys"):
+        self.all_tied_weights_keys = {}
+
+
+PreTrainedModel.__init__ = _patched_ptm_init
+
+
+def get_gpu_memory_mb():
+    """Return current GPU memory allocated in MB."""
+    if torch.cuda.is_available():
+        return torch.cuda.memory_allocated() / 1024 / 1024
+    return 0
+
+
+def get_gpu_memory_peak_mb():
+    """Return peak GPU memory allocated in MB."""
+    if torch.cuda.is_available():
+        return torch.cuda.max_memory_allocated() / 1024 / 1024
+    return 0
+
+
+def percentiles(data, pcts=[50, 95, 99]):
+    """Compute percentiles from list of values."""
+    arr = np.array(data)
+    return {f"p{p}": float(np.percentile(arr, p)) for p in pcts}
+
+
+def build_prompt(tokenizer, target_tokens):
+    """Build a prompt of approximately target_tokens length."""
+    base_text = (
+        "The quick brown fox jumps over the lazy dog. "
+        "In a world of artificial intelligence and machine learning, "
+        "we find ourselves at the intersection of technology and creativity. "
+        "The possibilities are endless when we consider the potential applications. "
+    )
+    # Repeat to exceed target, then truncate
+    repeated = base_text * (target_tokens // 10 + 10)
+    ids = tokenizer.encode(repeated, add_special_tokens=False)[:target_tokens]
+    return tokenizer.decode(ids)
+
+
+def benchmark_generate(model, tokenizer, prompt, max_new_tokens, device, batch_size=1):
+    """Run a single generation and measure E2E time.
+
+    Uses model.language_model.generate() to bypass the VLM wrapper's
+    img_context_token_id assertion, since we're benchmarking text-only
+    generation (matching the Neuron TKG benchmark methodology).
+
+    Returns dict with e2e_s, num_output_tokens, num_input_tokens, batch_size.
+    """
+    inputs = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True).to(
+        device
+    )
+    input_len = inputs["input_ids"].shape[1]
+
+    # Use language_model directly to avoid VLM wrapper assertion
+    gen_model = getattr(model, "language_model", model)
+
+    torch.cuda.synchronize()
+    t_start = time.perf_counter()
+
+    with torch.no_grad():
+        outputs = gen_model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            use_cache=True,
+        )
+
+    torch.cuda.synchronize()
+    t_end = time.perf_counter()
+
+    total_time = t_end - t_start
+    num_output_tokens = outputs.shape[1] - input_len
+
+    return {
+        "e2e_s": total_time,
+        "num_output_tokens": int(num_output_tokens),
+        "num_input_tokens": int(input_len),
+        "batch_size": batch_size,
+    }
+
+
+def measure_ttft(model, tokenizer, prompt, device, batch_size=1):
+    """Measure TTFT by generating exactly 1 token.
+
+    Uses model.language_model.generate() for text-only benchmarking.
+    """
+    inputs = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True).to(
+        device
+    )
+
+    # Use language_model directly to avoid VLM wrapper assertion
+    gen_model = getattr(model, "language_model", model)
+
+    torch.cuda.synchronize()
+    t_start = time.perf_counter()
+
+    with torch.no_grad():
+        gen_model.generate(
+            **inputs,
+            max_new_tokens=1,
+            do_sample=False,
+            use_cache=True,
+        )
+
+    torch.cuda.synchronize()
+    t_end = time.perf_counter()
+
+    return (t_end - t_start) * 1000  # ms
+
+
+def run_workload(
+    model,
+    tokenizer,
+    prompt,
+    max_new_tokens,
+    device,
+    batch_size,
+    num_warmup=5,
+    num_runs=10,
+):
+    """Run a complete workload with warmup and measurement."""
+    print(f"    Warmup ({num_warmup} runs)...")
+    for _ in range(num_warmup):
+        benchmark_generate(model, tokenizer, prompt, max_new_tokens, device, batch_size)
+
+    # Measure TTFT separately
+    print(f"    Measuring TTFT ({num_runs} runs)...")
+    ttft_values = []
+    for _ in range(num_runs):
+        ttft = measure_ttft(model, tokenizer, prompt, device, batch_size)
+        ttft_values.append(ttft)
+
+    # Measure full generation
+    print(f"    Measuring generation ({num_runs} runs)...")
+    results = []
+    for _ in range(num_runs):
+        r = benchmark_generate(
+            model, tokenizer, prompt, max_new_tokens, device, batch_size
+        )
+        results.append(r)
+
+    # Compute derived metrics
+    ttft_stats = percentiles(ttft_values)
+
+    e2e_values = [r["e2e_s"] * 1000 for r in results]  # ms
+    e2e_stats = percentiles(e2e_values)
+
+    # output tok/s = (num_output_tokens - 1) / (e2e - ttft)
+    # Using median TTFT for TPOT calculation
+    median_ttft_s = ttft_stats["p50"] / 1000
+    output_tps_values = []
+    for r in results:
+        decode_time = r["e2e_s"] - median_ttft_s
+        if decode_time > 0 and r["num_output_tokens"] > 1:
+            tps = (r["num_output_tokens"] - 1) / decode_time
+            output_tps_values.append(tps)
+    output_tps_stats = (
+        percentiles(output_tps_values)
+        if output_tps_values
+        else {"p50": 0, "p95": 0, "p99": 0}
+    )
+
+    # Total throughput (batch * per-request tok/s)
+    total_tps = output_tps_stats["p50"] * batch_size
+
+    peak_mem = get_gpu_memory_peak_mb()
+
+    return {
+        "ttft_ms": ttft_stats,
+        "output_tps": output_tps_stats,
+        "total_tps": total_tps,
+        "e2e_ms": e2e_stats,
+        "num_output_tokens": int(np.median([r["num_output_tokens"] for r in results])),
+        "num_input_tokens": results[0]["num_input_tokens"],
+        "batch_size": batch_size,
+        "peak_gpu_mem_mb": peak_mem,
+        "ttft_raw": ttft_values,
+        "e2e_raw": e2e_values,
+        "output_tps_raw": output_tps_values,
+    }
+
+
+WORKLOADS = {
+    "short-short": (128, 128),
+    "short-long": (128, 512),
+    "long-short": (2048, 128),
+    "long-long": (2048, 512),
+}
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="GPU benchmark for InternVL3-8B-Instruct"
+    )
+    parser.add_argument(
+        "--model",
+        default="/home/ubuntu/models/InternVL3-8B-Instruct",
+        help="Model path",
+    )
+    parser.add_argument(
+        "--batch-sizes",
+        nargs="+",
+        type=int,
+        default=[1, 2, 4, 8],
+        help="Batch sizes to test",
+    )
+    parser.add_argument(
+        "--workloads",
+        nargs="+",
+        default=["short-short", "short-long", "long-short", "long-long"],
+        help="Workload IDs to test",
+    )
+    parser.add_argument("--num-warmup", type=int, default=5, help="Warmup runs")
+    parser.add_argument("--num-runs", type=int, default=10, help="Measurement runs")
+    parser.add_argument("--compile", action="store_true", help="Use torch.compile")
+    parser.add_argument(
+        "--flash-attn", action="store_true", help="Use flash_attention_2"
+    )
+    parser.add_argument(
+        "--llm-only",
+        action="store_true",
+        help="Load Qwen2ForCausalLM directly with SDPA (bypasses VLM wrapper eager attention)",
+    )
+    parser.add_argument("--dtype", choices=["bfloat16", "float16"], default="bfloat16")
+    parser.add_argument(
+        "--output", default="gpu_benchmark_results.json", help="Output JSON file"
+    )
+    args = parser.parse_args()
+
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    dtype = torch.bfloat16 if args.dtype == "bfloat16" else torch.float16
+
+    print(f"=== GPU Benchmark: InternVL3-8B-Instruct ===")
+    print(f"Device: {device}")
+    if torch.cuda.is_available():
+        print(f"GPU: {torch.cuda.get_device_name(0)}")
+        print(
+            f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB"
+        )
+    print(f"dtype: {args.dtype}")
+    print(f"torch.compile: {args.compile}")
+    print(f"flash_attention_2: {args.flash_attn}")
+    print(f"LLM-only (SDPA): {args.llm_only}")
+    print(f"Batch sizes: {args.batch_sizes}")
+    print(f"Workloads: {args.workloads}")
+    print()
+
+    # Load model
+    print("Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    print("Loading model...")
+    t_load_start = time.perf_counter()
+    attn_impl = "flash_attention_2" if args.flash_attn else "sdpa"
+
+    if args.llm_only:
+        # Load Qwen2ForCausalLM directly with specified attn_implementation.
+        # This avoids the InternVLChatModel wrapper which forces eager attention
+        # and doesn't support attn_implementation parameter.
+        from transformers import Qwen2ForCausalLM, AutoConfig
+        from safetensors.torch import load_file
+        import glob
+
+        config = AutoConfig.from_pretrained(args.model, trust_remote_code=True)
+        llm_config = config.llm_config
+        llm_config._attn_implementation = attn_impl
+
+        # Create model on meta device (zero memory), load weights to GPU,
+        # then materialize buffers (rotary_emb.inv_freq) on GPU.
+        with torch.device("meta"):
+            model = Qwen2ForCausalLM(llm_config)
+
+        shard_files = sorted(glob.glob(os.path.join(args.model, "model*.safetensors")))
+        prefix = "language_model."
+        for sf in shard_files:
+            shard = load_file(sf, device="cuda:0")
+            for k, v in shard.items():
+                if k.startswith(prefix):
+                    param_name = k[len(prefix) :]
+                    parts = param_name.split(".")
+                    obj = model
+                    for p in parts[:-1]:
+                        obj = getattr(obj, p)
+                    setattr(
+                        obj,
+                        parts[-1],
+                        torch.nn.Parameter(v.to(dtype), requires_grad=False),
+                    )
+            del shard
+            gc.collect()
+            torch.cuda.empty_cache()
+
+        # Materialize rotary embedding buffers (computed from config, not weights)
+        rope = model.model.rotary_emb
+        dim = llm_config.hidden_size // llm_config.num_attention_heads
+        base = getattr(llm_config, "rope_theta", 10000.0)
+        inv_freq = 1.0 / (
+            base ** (torch.arange(0, dim, 2, dtype=torch.float32, device="cuda") / dim)
+        )
+        rope.inv_freq = inv_freq
+        if hasattr(rope, "original_inv_freq"):
+            rope.original_inv_freq = inv_freq.clone()
+    else:
+        # Load full VLM (uses eager attention by default)
+        model = AutoModelForCausalLM.from_pretrained(
+            args.model,
+            dtype=dtype,
+            device_map="auto",
+            trust_remote_code=True,
+        )
+    model.eval()
+    t_load = time.perf_counter() - t_load_start
+    print(f"Model loaded in {t_load:.1f}s")
+    print(f"GPU memory after load: {get_gpu_memory_mb():.0f} MB")
+
+    if args.compile:
+        print("Compiling model with torch.compile...")
+        t_compile_start = time.perf_counter()
+        model = torch.compile(model, mode="reduce-overhead")
+        # Trigger compilation with a small forward pass
+        dummy = tokenizer("Hello", return_tensors="pt").to(device)
+        with torch.no_grad():
+            model.generate(**dummy, max_new_tokens=2, do_sample=False)
+        t_compile = time.perf_counter() - t_compile_start
+        print(f"Compilation done in {t_compile:.1f}s")
+
+    # Build prompts for each workload
+    prompts = {}
+    for wl_id in args.workloads:
+        input_tokens, _ = WORKLOADS[wl_id]
+        prompts[wl_id] = build_prompt(tokenizer, input_tokens)
+
+    # Run benchmarks
+    all_results = {
+        "metadata": {
+            "model": args.model,
+            "gpu": torch.cuda.get_device_name(0)
+            if torch.cuda.is_available()
+            else "N/A",
+            "vram_gb": torch.cuda.get_device_properties(0).total_memory / 1024**3
+            if torch.cuda.is_available()
+            else 0,
+            "dtype": args.dtype,
+            "torch_compile": args.compile,
+            "flash_attention_2": args.flash_attn,
+            "llm_only": args.llm_only,
+            "attn_implementation": attn_impl,
+            "load_time_s": t_load,
+            "torch_version": torch.__version__,
+            "cuda_version": torch.version.cuda if torch.cuda.is_available() else "N/A",
+        },
+        "results": {},
+    }
+
+    for wl_id in args.workloads:
+        input_tokens, output_tokens = WORKLOADS[wl_id]
+        print(f"\n--- Workload: {wl_id} (in={input_tokens}, out={output_tokens}) ---")
+
+        for bs in args.batch_sizes:
+            print(f"  Batch size: {bs}")
+            torch.cuda.reset_peak_memory_stats()
+
+            try:
+                result = run_workload(
+                    model,
+                    tokenizer,
+                    prompts[wl_id],
+                    output_tokens,
+                    device,
+                    bs,
+                    args.num_warmup,
+                    args.num_runs,
+                )
+
+                key = f"{wl_id}_bs{bs}"
+                all_results["results"][key] = result
+
+                print(f"    TTFT P50: {result['ttft_ms']['p50']:.1f} ms")
+                print(f"    Output tok/s P50: {result['output_tps']['p50']:.1f}")
+                print(f"    Total tok/s: {result['total_tps']:.1f}")
+                print(f"    E2E P50: {result['e2e_ms']['p50']:.1f} ms")
+                print(f"    Peak GPU: {result['peak_gpu_mem_mb']:.0f} MB")
+
+            except torch.cuda.OutOfMemoryError:
+                print(
+                    f"    OOM at batch_size={bs}, skipping larger batches for this workload"
+                )
+                key = f"{wl_id}_bs{bs}"
+                all_results["results"][key] = {"error": "OOM"}
+                gc.collect()
+                torch.cuda.empty_cache()
+                break
+            except Exception as e:
+                print(f"    Error: {e}")
+                key = f"{wl_id}_bs{bs}"
+                all_results["results"][key] = {"error": str(e)}
+
+    # Save results
+    # Remove raw arrays for clean JSON (keep percentiles)
+    for key, val in all_results["results"].items():
+        if isinstance(val, dict):
+            val.pop("ttft_raw", None)
+            val.pop("e2e_raw", None)
+            val.pop("output_tps_raw", None)
+
+    with open(args.output, "w") as f:
+        json.dump(all_results, f, indent=2)
+    print(f"\nResults saved to {args.output}")
+
+    # Print summary table
+    print("\n=== SUMMARY ===")
+    print(
+        f"{'Workload':<15} {'BS':>3} {'TTFT P50':>10} {'tok/s P50':>10} {'Total tok/s':>12} {'E2E P50':>10} {'Peak MB':>8}"
+    )
+    print("-" * 75)
+    for key, val in all_results["results"].items():
+        if "error" in val:
+            print(f"{key:<15} {'':>3} {'ERROR':>10}")
+            continue
+        wl = key.rsplit("_bs", 1)[0]
+        bs = val["batch_size"]
+        print(
+            f"{wl:<15} {bs:>3} {val['ttft_ms']['p50']:>9.1f}ms {val['output_tps']['p50']:>9.1f} {val['total_tps']:>11.1f} {val['e2e_ms']['p50']:>9.1f}ms {val['peak_gpu_mem_mb']:>7.0f}"
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/InternVL3-8B-Instruct/gpu_eager_results.json b/contrib/models/InternVL3-8B-Instruct/gpu_eager_results.json
new file mode 100644
index 00000000..04685720
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/gpu_eager_results.json
@@ -0,0 +1,368 @@
+{
+  "metadata": {
+    "model": "/home/ubuntu/models/InternVL3-8B-Instruct",
+    "gpu": "NVIDIA L40S",
+    "vram_gb": 44.39215087890625,
+    "dtype": "bfloat16",
+    "torch_compile": false,
+    "flash_attention_2": false,
+    "attn_implementation": "eager",
+    "load_time_s": 2.7835791559999734,
+    "torch_version": "2.11.0+cu130",
+    "cuda_version": "13.0"
+  },
+  "results": {
+    "short-short_bs1": {
+      "ttft_ms": {
+        "p50": 32.354057999953056,
+        "p95": 33.15108340003121,
+        "p99": 33.32482948002621
+      },
+      "output_tps": {
+        "p50": 35.77090147724594,
+        "p95": 35.82600723018595,
+        "p99": 35.83171407477799
+      },
+      "total_tps": 35.77090147724594,
+      "e2e_ms": {
+        "p50": 3582.725865000043,
+        "p95": 3590.407341399998,
+        "p99": 3591.927379479987
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 15187.84716796875
+    },
+    "short-short_bs2": {
+      "ttft_ms": {
+        "p50": 36.637766500007274,
+        "p95": 36.82605119992104,
+        "p99": 36.908319839928936
+      },
+      "output_tps": {
+        "p50": 34.5048784058082,
+        "p95": 34.542452965086376,
+        "p99": 34.544448535694606
+      },
+      "total_tps": 69.0097568116164,
+      "e2e_ms": {
+        "p50": 3717.2767779999845,
+        "p95": 3725.8738193000227,
+        "p99": 3726.8450942599975
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 15213.19482421875
+    },
+    "short-short_bs4": {
+      "ttft_ms": {
+        "p50": 49.34426150003901,
+        "p95": 49.5032485500019,
+        "p99": 49.514977709970935
+      },
+      "output_tps": {
+        "p50": 34.43159631420792,
+        "p95": 34.55469835232628,
+        "p99": 34.566173877666714
+      },
+      "total_tps": 137.7263852568317,
+      "e2e_ms": {
+        "p50": 3737.816877,
+        "p95": 3746.363267700036,
+        "p99": 3750.962503140014
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 15263.89013671875
+    },
+    "short-short_bs8": {
+      "ttft_ms": {
+        "p50": 85.9266380000463,
+        "p95": 86.55312020001702,
+        "p99": 86.62800883998216
+      },
+      "output_tps": {
+        "p50": 34.63620634453231,
+        "p95": 34.716941318185604,
+        "p99": 34.722339530463024
+      },
+      "total_tps": 277.0896507562585,
+      "e2e_ms": {
+        "p50": 3752.609955999901,
+        "p95": 3758.9112740999326,
+        "p99": 3759.18464441987
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 15368.28076171875
+    },
+    "short-long_bs1": {
+      "ttft_ms": {
+        "p50": 32.18330999993668,
+        "p95": 32.39394204996415,
+        "p99": 32.40653161000182
+      },
+      "output_tps": {
+        "p50": 35.72113986323237,
+        "p95": 35.777438029608504,
+        "p99": 35.785452946858314
+      },
+      "total_tps": 35.72113986323237,
+      "e2e_ms": {
+        "p50": 14337.438011000017,
+        "p95": 14350.85827805009,
+        "p99": 14351.982720410077
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 15206.97998046875
+    },
+    "short-long_bs2": {
+      "ttft_ms": {
+        "p50": 37.005864000093425,
+        "p95": 37.88415374997385,
+        "p99": 38.18396354991819
+      },
+      "output_tps": {
+        "p50": 34.43897257983457,
+        "p95": 34.48666508664127,
+        "p99": 34.49894384968646
+      },
+      "total_tps": 68.87794515966914,
+      "e2e_ms": {
+        "p50": 14874.846941000102,
+        "p95": 14905.762557699903,
+        "p99": 14908.412011539836
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 15255.763671875
+    },
+    "short-long_bs4": {
+      "ttft_ms": {
+        "p50": 51.75586450002356,
+        "p95": 52.29615260000173,
+        "p99": 52.350610520015834
+      },
+      "output_tps": {
+        "p50": 34.47769264465441,
+        "p95": 34.500958325467394,
+        "p99": 34.50363864660217
+      },
+      "total_tps": 137.91077057861764,
+      "e2e_ms": {
+        "p50": 14872.933870499992,
+        "p95": 14896.790627600114,
+        "p99": 14903.012756720065
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 15341.27783203125
+    },
+    "short-long_bs8": {
+      "ttft_ms": {
+        "p50": 89.50201800007562,
+        "p95": 90.39046830010875,
+        "p99": 90.51499446015214
+      },
+      "output_tps": {
+        "p50": 34.42872006236655,
+        "p95": 34.66731227310019,
+        "p99": 34.706034630149354
+      },
+      "total_tps": 275.4297604989324,
+      "e2e_ms": {
+        "p50": 14931.762931499861,
+        "p95": 14944.882797649983,
+        "p99": 14945.006321930065
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 15518.93603515625
+    },
+    "long-short_bs1": {
+      "ttft_ms": {
+        "p50": 303.1730725000443,
+        "p95": 303.75679499994703,
+        "p99": 303.7627061998887
+      },
+      "output_tps": {
+        "p50": 35.607207299133826,
+        "p95": 35.66252668818413,
+        "p99": 35.672457147256985
+      },
+      "total_tps": 35.607207299133826,
+      "e2e_ms": {
+        "p50": 3869.8667569999543,
+        "p95": 3879.8164300499707,
+        "p99": 3880.5298892099586
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 16487.56201171875
+    },
+    "long-short_bs2": {
+      "ttft_ms": {
+        "p50": 613.3764324997628,
+        "p95": 613.7428632499223,
+        "p99": 613.854900650008
+      },
+      "output_tps": {
+        "p50": 33.65574251589115,
+        "p95": 34.24936526776137,
+        "p99": 34.256545221728224
+      },
+      "total_tps": 67.3114850317823,
+      "e2e_ms": {
+        "p50": 4387.051700500024,
+        "p95": 4422.550250550103,
+        "p99": 4425.375689310295
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 17812.62451171875
+    },
+    "long-short_bs4": {
+      "ttft_ms": {
+        "p50": 1288.1859030001124,
+        "p95": 1289.7521684999674,
+        "p99": 1289.982536099892
+      },
+      "output_tps": {
+        "p50": 28.178850365447133,
+        "p95": 28.195598401580405,
+        "p99": 28.200826007634085
+      },
+      "total_tps": 112.71540146178853,
+      "e2e_ms": {
+        "p50": 5795.112162499663,
+        "p95": 5801.732641050148,
+        "p99": 5803.493642610229
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 20462.74951171875
+    },
+    "long-short_bs8": {
+      "ttft_ms": {
+        "p50": 2569.476913000017,
+        "p95": 2572.339662199738,
+        "p99": 2572.4010364396236
+      },
+      "output_tps": {
+        "p50": 20.25407951072588,
+        "p95": 20.26397109858702,
+        "p99": 20.266539624513456
+      },
+      "total_tps": 162.03263608580704,
+      "e2e_ms": {
+        "p50": 8839.818671000103,
+        "p95": 8845.999159650022,
+        "p99": 8847.621630330092
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 25762.99951171875
+    },
+    "long-long_bs1": {
+      "ttft_ms": {
+        "p50": 304.96403350002765,
+        "p95": 306.02650549981263,
+        "p99": 306.04908829967826
+      },
+      "output_tps": {
+        "p50": 35.3366046536535,
+        "p95": 36.03219428042833,
+        "p99": 36.0577051806044
+      },
+      "total_tps": 35.3366046536535,
+      "e2e_ms": {
+        "p50": 14766.573178500039,
+        "p95": 15155.286746100068,
+        "p99": 15161.88399161993
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 16487.56201171875
+    },
+    "long-long_bs2": {
+      "ttft_ms": {
+        "p50": 616.9882489998599,
+        "p95": 618.1799187000024,
+        "p99": 618.2847917400568
+      },
+      "output_tps": {
+        "p50": 33.205150951658126,
+        "p95": 33.28598728698499,
+        "p99": 33.286820570797914
+      },
+      "total_tps": 66.41030190331625,
+      "e2e_ms": {
+        "p50": 16006.172720999984,
+        "p95": 16067.58630139991,
+        "p99": 16068.676441879794
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 17812.62451171875
+    },
+    "long-long_bs4": {
+      "ttft_ms": {
+        "p50": 1294.58508599987,
+        "p95": 1295.7390099498298,
+        "p99": 1295.7449099898076
+      },
+      "output_tps": {
+        "p50": 27.131085707109804,
+        "p95": 27.178007632544446,
+        "p99": 27.18076301388413
+      },
+      "total_tps": 108.52434282843922,
+      "e2e_ms": {
+        "p50": 20129.069234499868,
+        "p95": 20146.765666599913,
+        "p99": 20149.247863720102
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 20462.74951171875
+    },
+    "long-long_bs8": {
+      "ttft_ms": {
+        "p50": 2595.168072999968,
+        "p95": 2595.9311825999976,
+        "p99": 2595.9620173200165
+      },
+      "output_tps": {
+        "p50": 19.29254072468826,
+        "p95": 19.29867283844891,
+        "p99": 19.29875400117357
+      },
+      "total_tps": 154.34032579750607,
+      "e2e_ms": {
+        "p50": 29082.0893744999,
+        "p95": 29105.822070200087,
+        "p99": 29114.79696524008
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 25762.99951171875
+    }
+  }
+}
\ No newline at end of file
diff --git a/contrib/models/InternVL3-8B-Instruct/gpu_sdpa_results.json b/contrib/models/InternVL3-8B-Instruct/gpu_sdpa_results.json
new file mode 100644
index 00000000..e368afdb
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/gpu_sdpa_results.json
@@ -0,0 +1,369 @@
+{
+  "metadata": {
+    "model": "/home/ubuntu/models/InternVL3-8B-Instruct",
+    "gpu": "NVIDIA L40S",
+    "vram_gb": 44.39215087890625,
+    "dtype": "bfloat16",
+    "torch_compile": false,
+    "flash_attention_2": false,
+    "llm_only": true,
+    "attn_implementation": "sdpa",
+    "load_time_s": 3.0221697830002086,
+    "torch_version": "2.11.0+cu130",
+    "cuda_version": "13.0"
+  },
+  "results": {
+    "short-short_bs1": {
+      "ttft_ms": {
+        "p50": 30.74279850034145,
+        "p95": 31.48099169957277,
+        "p99": 31.76208473953011
+      },
+      "output_tps": {
+        "p50": 42.30778138863015,
+        "p95": 42.379636837129674,
+        "p99": 42.386728257586434
+      },
+      "total_tps": 42.30778138863015,
+      "e2e_ms": {
+        "p50": 3032.5547570005256,
+        "p95": 3040.9743047998745,
+        "p99": 3044.5176537598854
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 14620.37255859375
+    },
+    "short-short_bs2": {
+      "ttft_ms": {
+        "p50": 36.89017650003734,
+        "p95": 37.75985954985117,
+        "p99": 37.991001509772104
+      },
+      "output_tps": {
+        "p50": 38.85500399432938,
+        "p95": 38.99119990638746,
+        "p99": 38.99583190830265
+      },
+      "total_tps": 77.71000798865876,
+      "e2e_ms": {
+        "p50": 3305.453021500398,
+        "p95": 3312.4970062505326,
+        "p99": 3312.869482050455
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 14647.81396484375
+    },
+    "short-short_bs4": {
+      "ttft_ms": {
+        "p50": 52.477689999705035,
+        "p95": 53.20549865009525,
+        "p99": 53.2738997301567
+      },
+      "output_tps": {
+        "p50": 39.93607824520085,
+        "p95": 40.038246950166084,
+        "p99": 40.03862530090416
+      },
+      "total_tps": 159.7443129808034,
+      "e2e_ms": {
+        "p50": 3232.559605000006,
+        "p95": 3242.2205611002482,
+        "p99": 3243.2934482203837
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 14693.69677734375
+    },
+    "short-short_bs8": {
+      "ttft_ms": {
+        "p50": 94.34561149964793,
+        "p95": 95.62430669930109,
+        "p99": 95.90465813924311
+      },
+      "output_tps": {
+        "p50": 40.38092376775974,
+        "p95": 40.66735596958788,
+        "p99": 40.6812387376469
+      },
+      "total_tps": 323.0473901420779,
+      "e2e_ms": {
+        "p50": 3239.4082569999227,
+        "p95": 3251.798099450207,
+        "p99": 3252.7435670900377
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 128,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 14794.46240234375
+    },
+    "short-long_bs1": {
+      "ttft_ms": {
+        "p50": 30.909600000086357,
+        "p95": 31.91501929986771,
+        "p99": 32.45438785977967
+      },
+      "output_tps": {
+        "p50": 41.619623214111066,
+        "p95": 42.26447092515339,
+        "p99": 42.29319360793472
+      },
+      "total_tps": 41.619623214111066,
+      "e2e_ms": {
+        "p50": 12308.809939499952,
+        "p95": 12375.742051450607,
+        "p99": 12377.803886290776
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 14632.34765625
+    },
+    "short-long_bs2": {
+      "ttft_ms": {
+        "p50": 36.9804510000904,
+        "p95": 37.11390899984508,
+        "p99": 37.11395219971564
+      },
+      "output_tps": {
+        "p50": 40.47054818555414,
+        "p95": 40.57791756183197,
+        "p99": 40.5790470427784
+      },
+      "total_tps": 80.94109637110827,
+      "e2e_ms": {
+        "p50": 12663.446808499884,
+        "p95": 12837.0766125498,
+        "p99": 12840.898869710118
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 14682.888671875
+    },
+    "short-long_bs4": {
+      "ttft_ms": {
+        "p50": 54.544400999930076,
+        "p95": 55.010928550018434,
+        "p99": 55.030793709647696
+      },
+      "output_tps": {
+        "p50": 38.93960190731574,
+        "p95": 38.988901332053985,
+        "p99": 38.99245982780393
+      },
+      "total_tps": 155.75840762926296,
+      "e2e_ms": {
+        "p50": 13177.43278350008,
+        "p95": 13214.839803450468,
+        "p99": 13227.184851090515
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 14743.64599609375
+    },
+    "short-long_bs8": {
+      "ttft_ms": {
+        "p50": 95.25800050005273,
+        "p95": 96.70880335024776,
+        "p99": 97.0619550702213
+      },
+      "output_tps": {
+        "p50": 38.84570474053821,
+        "p95": 39.06611881953545,
+        "p99": 39.08596705508407
+      },
+      "total_tps": 310.7656379243057,
+      "e2e_ms": {
+        "p50": 13249.875117000101,
+        "p95": 13316.404957699979,
+        "p99": 13330.64420834
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 128,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 14888.17529296875
+    },
+    "long-short_bs1": {
+      "ttft_ms": {
+        "p50": 153.5174064997591,
+        "p95": 154.7932400996615,
+        "p99": 154.80517121954108
+      },
+      "output_tps": {
+        "p50": 40.53025139830835,
+        "p95": 40.58484314562401,
+        "p99": 40.590673978318726
+      },
+      "total_tps": 40.53025139830835,
+      "e2e_ms": {
+        "p50": 3286.9802619998154,
+        "p95": 3300.1087782500235,
+        "p99": 3302.5685340499967
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 14992.86865234375
+    },
+    "long-short_bs2": {
+      "ttft_ms": {
+        "p50": 302.34673349968944,
+        "p95": 303.66068200032714,
+        "p99": 303.66209320013695
+      },
+      "output_tps": {
+        "p50": 39.728571284136216,
+        "p95": 39.78733084601901,
+        "p99": 39.7913037090361
+      },
+      "total_tps": 79.45714256827243,
+      "e2e_ms": {
+        "p50": 3499.0388960004566,
+        "p95": 3507.1851427002002,
+        "p99": 3507.734680539961
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 15378.43115234375
+    },
+    "long-short_bs4": {
+      "ttft_ms": {
+        "p50": 650.9154179998404,
+        "p95": 656.7162365998684,
+        "p99": 657.8119801202502
+      },
+      "output_tps": {
+        "p50": 38.51483203549644,
+        "p95": 38.558916731545736,
+        "p99": 38.56685707019547
+      },
+      "total_tps": 154.05932814198576,
+      "e2e_ms": {
+        "p50": 3948.346405999928,
+        "p95": 3952.6882943496275,
+        "p99": 3953.0145756695674
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 16161.55615234375
+    },
+    "long-short_bs8": {
+      "ttft_ms": {
+        "p50": 1320.8750485000564,
+        "p95": 1339.1258676504094,
+        "p99": 1340.368192730466
+      },
+      "output_tps": {
+        "p50": 36.06295936198886,
+        "p95": 36.14949760831809,
+        "p99": 36.15333958816673
+      },
+      "total_tps": 288.5036748959109,
+      "e2e_ms": {
+        "p50": 4842.49440900021,
+        "p95": 4848.218032050045,
+        "p99": 4848.652426410317
+      },
+      "num_output_tokens": 128,
+      "num_input_tokens": 2048,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 17724.80615234375
+    },
+    "long-long_bs1": {
+      "ttft_ms": {
+        "p50": 160.33866150019094,
+        "p95": 166.07691539979896,
+        "p99": 168.71254307975505
+      },
+      "output_tps": {
+        "p50": 42.23309606359,
+        "p95": 42.30755131546222,
+        "p99": 42.30923809368367
+      },
+      "total_tps": 42.23309606359,
+      "e2e_ms": {
+        "p50": 12259.857311499673,
+        "p95": 12473.311311049429,
+        "p99": 12478.515446209485
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 1,
+      "peak_gpu_mem_mb": 14992.86865234375
+    },
+    "long-long_bs2": {
+      "ttft_ms": {
+        "p50": 350.13424400040094,
+        "p95": 363.21154429965645,
+        "p99": 364.2686288592722
+      },
+      "output_tps": {
+        "p50": 39.66795834426845,
+        "p95": 39.76727403069557,
+        "p99": 39.77517209438742
+      },
+      "total_tps": 79.3359166885369,
+      "e2e_ms": {
+        "p50": 13232.075230000191,
+        "p95": 13285.502536349895,
+        "p99": 13298.716810470114
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 2,
+      "peak_gpu_mem_mb": 15378.43115234375
+    },
+    "long-long_bs4": {
+      "ttft_ms": {
+        "p50": 667.7430545000789,
+        "p95": 678.1529508004041,
+        "p99": 681.4914525602308
+      },
+      "output_tps": {
+        "p50": 38.25393678850787,
+        "p95": 38.28818350222392,
+        "p99": 38.290061948362975
+      },
+      "total_tps": 153.0157471540315,
+      "e2e_ms": {
+        "p50": 14025.845337499959,
+        "p95": 14048.56425299954,
+        "p99": 14051.948389799372
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 4,
+      "peak_gpu_mem_mb": 16161.55615234375
+    },
+    "long-long_bs8": {
+      "ttft_ms": {
+        "p50": 1317.572557499716,
+        "p95": 1323.4350110004925,
+        "p99": 1323.876147800347
+      },
+      "output_tps": {
+        "p50": 35.87654712657549,
+        "p95": 36.01936773697751,
+        "p99": 36.02395023602966
+      },
+      "total_tps": 287.0123770126039,
+      "e2e_ms": {
+        "p50": 15560.860764499921,
+        "p95": 15619.46175729986,
+        "p99": 15624.53301465971
+      },
+      "num_output_tokens": 512,
+      "num_input_tokens": 2048,
+      "batch_size": 8,
+      "peak_gpu_mem_mb": 17724.80615234375
+    }
+  }
+}
\ No newline at end of file
diff --git a/contrib/models/InternVL3-8B-Instruct/nki_benchmark.py b/contrib/models/InternVL3-8B-Instruct/nki_benchmark.py
new file mode 100644
index 00000000..4baf570c
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/nki_benchmark.py
@@ -0,0 +1,350 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+NKI kernel benchmark for InternVL3-8B-Instruct.
+
+Compiles with opt-in NKI kernels enabled, then benchmarks TTFT and TKG
+throughput. Compares against baseline (non-NKI) numbers.
+
+NKI kernels enabled:
+  - qkv_kernel_enabled: Fused QKV projection with NKI RMSNorm (CTE + TKG)
+  - mlp_kernel_enabled: Fused MLP (gate/up/down) kernel with NKI RMSNorm (CTE + TKG)
+
+Note: attn_block_tkg_nki_kernel_enabled triggers NCC_ISTP902 compiler bug in SDK 2.29
+(StaticProfiler error in O-proj all-reduce). Omitted until SDK fix.
+
+Usage:
+    # Full compile + benchmark
+    python nki_benchmark.py
+
+    # Skip compile, just benchmark
+    python nki_benchmark.py --skip-compile
+
+    # Accuracy validation after compile
+    python nki_benchmark.py --accuracy-only --skip-compile
+
+Target: trn2.3xlarge LNC=2 TP=4
+"""
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import torch
+
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_NKI_PATH = "/mnt/models/neuron_models/InternVL3-8B-VLM-NKI/"
+COMPILED_BASELINE_PATH = "/mnt/models/neuron_models/InternVL3-8B-VLM/"
+RESULTS_PATH = "/mnt/models/nki_benchmark_results.json"
+
+# Baseline numbers (from Task 015 scale_test at seq_len=2048 batch_size=1)
+BASELINE = {
+    "ttft_ms": 138,
+    "tkg_tok_s": 75.1,
+}
+
+
+def create_nki_config(seq_len=2048, batch_size=1):
+    """Create config with NKI kernels enabled for the text backbone."""
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=batch_size,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+        fused_qkv=True,  # Required for NKI QKV kernel
+        # NKI kernel flags
+        qkv_kernel_enabled=True,  # Fused QKV projection with NKI RMSNorm
+        mlp_kernel_enabled=True,  # Fused MLP with NKI RMSNorm
+        # NOTE: attn_block_tkg_nki_kernel_enabled=True triggers NCC_ISTP902
+        # compiler bug in SDK 2.29 (StaticProfiler error in O-proj all-reduce).
+        # Omitted until SDK fix is available.
+    )
+
+    # Vision NeuronConfig: same as baseline (no NKI for vision encoder)
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    return config
+
+
+def compile_nki(config, compiled_path):
+    """Compile model with NKI kernels."""
+    print("=" * 60)
+    print("COMPILING WITH NKI KERNELS")
+    print("=" * 60)
+    nc = config.text_config.neuron_config
+    print(f"  qkv_kernel_enabled:                   {nc.qkv_kernel_enabled}")
+    print(f"  qkv_nki_kernel_enabled:               {nc.qkv_nki_kernel_enabled}")
+    print(
+        f"  attn_block_tkg_nki_kernel_enabled:     {nc.attn_block_tkg_nki_kernel_enabled}"
+    )
+    print(f"  mlp_kernel_enabled:                    {nc.mlp_kernel_enabled}")
+    print(f"  fused_qkv:                             {nc.fused_qkv}")
+    print(f"  seq_len: {nc.seq_len}, batch_size: {nc.max_batch_size}")
+    print(f"  Output: {compiled_path}")
+
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+
+    start = time.time()
+    model.compile(compiled_path)
+    compile_time = time.time() - start
+    print(
+        f"\nCompilation completed in {compile_time:.1f}s ({compile_time / 60:.1f} min)"
+    )
+
+    return model, compile_time
+
+
+def load_model(config, compiled_path):
+    """Load already-compiled model."""
+    print(f"\nLoading compiled model from {compiled_path}")
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+    start = time.time()
+    model.load(compiled_path)
+    load_time = time.time() - start
+    print(f"Loaded in {load_time:.1f}s")
+    return model
+
+
+def measure_ttft(model, tokenizer, n_runs=5):
+    """Measure TTFT for text-only CTE."""
+    messages = [{"role": "user", "content": "What is the capital of France?"}]
+    templated = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    inputs = tokenizer(templated, return_tensors="pt")
+    input_ids = inputs.input_ids
+
+    seq_len = input_ids.shape[-1]
+    position_ids = torch.arange(seq_len, dtype=torch.int32).unsqueeze(0)
+    seq_ids = torch.zeros(1, dtype=torch.int32)
+
+    # Warmup
+    model.reset()
+    with torch.no_grad():
+        model(input_ids=input_ids, position_ids=position_ids, seq_ids=seq_ids)
+
+    # Measure
+    ttft_times = []
+    for _ in range(n_runs):
+        model.reset()
+        start = time.perf_counter()
+        with torch.no_grad():
+            model(input_ids=input_ids, position_ids=position_ids, seq_ids=seq_ids)
+        ttft_times.append(time.perf_counter() - start)
+
+    avg = sum(ttft_times) / len(ttft_times)
+    return avg * 1000  # ms
+
+
+def measure_tkg(model, tokenizer, num_tokens=64, n_runs=3):
+    """Measure TKG throughput using HuggingFaceGenerationAdapter."""
+    adapter = HuggingFaceGenerationAdapter(model)
+
+    messages = [
+        {"role": "user", "content": "Tell me a detailed story about a brave knight."}
+    ]
+    templated = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    inputs = tokenizer(templated, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = torch.ones_like(input_ids)
+
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+
+    # Warmup
+    with torch.no_grad():
+        adapter.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=num_tokens,
+            do_sample=False,
+            eos_token_id=eos_token_id,
+        )
+
+    # Measure
+    tkg_results = []
+    for _ in range(n_runs):
+        start = time.perf_counter()
+        with torch.no_grad():
+            output_ids = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_tokens,
+                do_sample=False,
+                eos_token_id=eos_token_id,
+            )
+        total_time = time.perf_counter() - start
+        generated = output_ids.shape[-1] - input_ids.shape[-1]
+        tkg_results.append((generated, total_time))
+
+    # Average tokens/sec (excluding TTFT — approximate by subtracting first-token overhead)
+    avg_tokens = sum(r[0] for r in tkg_results) / len(tkg_results)
+    avg_time = sum(r[1] for r in tkg_results) / len(tkg_results)
+    tok_s = avg_tokens / avg_time if avg_time > 0 else 0.0
+
+    return tok_s, avg_tokens, avg_time
+
+
+def run_accuracy_check(adapter, tokenizer):
+    """Quick accuracy sanity check."""
+    prompts = [
+        ("france", "What is the capital of France?"),
+        ("math", "What is 2 + 2? Answer with just the number:"),
+    ]
+
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    results = {}
+
+    for name, prompt_text in prompts:
+        messages = [{"role": "user", "content": prompt_text}]
+        templated = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = tokenizer(templated, return_tensors="pt")
+        input_ids = inputs.input_ids
+        attention_mask = torch.ones_like(input_ids)
+
+        with torch.no_grad():
+            output_ids = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=32,
+                do_sample=False,
+                eos_token_id=eos_token_id,
+            )
+
+        generated_ids = output_ids[0, input_ids.shape[-1] :].tolist()
+        text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+        results[name] = text
+        print(f"  {name}: {text!r}")
+
+    # Basic sanity
+    france_ok = "paris" in results.get("france", "").lower()
+    math_ok = "4" in results.get("math", "")
+    return france_ok and math_ok
+
+
+def main():
+    parser = argparse.ArgumentParser(description="NKI kernel benchmark")
+    parser.add_argument("--skip-compile", action="store_true", help="Skip compilation")
+    parser.add_argument(
+        "--accuracy-only", action="store_true", help="Only run accuracy check"
+    )
+    parser.add_argument("--seq-len", type=int, default=2048, help="Sequence length")
+    parser.add_argument("--batch-size", type=int, default=1, help="Batch size")
+    args = parser.parse_args()
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    config = create_nki_config(seq_len=args.seq_len, batch_size=args.batch_size)
+    compile_time = None
+
+    if not args.skip_compile:
+        model, compile_time = compile_nki(config, COMPILED_NKI_PATH)
+        # After compile(), model already has weights loaded in memory.
+        # Call load() to load the compiled NEFFs onto Neuron devices
+        # and save sharded weight checkpoints for future --skip-compile runs.
+        model.load(COMPILED_NKI_PATH)
+    else:
+        model = load_model(config, COMPILED_NKI_PATH)
+
+    # Accuracy check
+    print("\n--- Accuracy Sanity Check ---")
+    adapter = HuggingFaceGenerationAdapter(model)
+    accuracy_ok = run_accuracy_check(adapter, tokenizer)
+    print(f"Accuracy: {'PASS' if accuracy_ok else 'FAIL'}")
+
+    if args.accuracy_only:
+        return
+
+    # TTFT benchmark
+    print("\n--- TTFT Benchmark ---")
+    ttft_ms = measure_ttft(model, tokenizer, n_runs=5)
+    ttft_delta = ((ttft_ms - BASELINE["ttft_ms"]) / BASELINE["ttft_ms"]) * 100
+    print(
+        f"  TTFT:     {ttft_ms:.1f} ms (baseline: {BASELINE['ttft_ms']} ms, delta: {ttft_delta:+.1f}%)"
+    )
+
+    # TKG benchmark
+    print("\n--- TKG Throughput Benchmark ---")
+    tok_s, avg_tokens, avg_time = measure_tkg(model, tokenizer, num_tokens=64, n_runs=3)
+    tkg_delta = ((tok_s - BASELINE["tkg_tok_s"]) / BASELINE["tkg_tok_s"]) * 100
+    print(
+        f"  TKG:      {tok_s:.1f} tok/s (baseline: {BASELINE['tkg_tok_s']} tok/s, delta: {tkg_delta:+.1f}%)"
+    )
+    print(f"  Avg tokens generated: {avg_tokens:.0f}")
+    print(f"  Avg total time: {avg_time:.3f}s")
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("NKI BENCHMARK SUMMARY")
+    print("=" * 60)
+    print(f"  Config:       seq_len={args.seq_len}, batch_size={args.batch_size}")
+    print(f"  NKI kernels:  qkv + mlp (fused RMSNorm)")
+    print(f"  Accuracy:     {'PASS' if accuracy_ok else 'FAIL'}")
+    print(
+        f"  TTFT:         {ttft_ms:.1f} ms  (baseline: {BASELINE['ttft_ms']} ms, {ttft_delta:+.1f}%)"
+    )
+    print(
+        f"  TKG:          {tok_s:.1f} tok/s (baseline: {BASELINE['tkg_tok_s']} tok/s, {tkg_delta:+.1f}%)"
+    )
+    if compile_time:
+        print(f"  Compile time: {compile_time:.1f}s ({compile_time / 60:.1f} min)")
+
+    # Save results
+    results = {
+        "config": {
+            "seq_len": args.seq_len,
+            "batch_size": args.batch_size,
+            "nki_kernels": [
+                "qkv_kernel_enabled",
+                "mlp_kernel_enabled",
+            ],
+        },
+        "accuracy": "PASS" if accuracy_ok else "FAIL",
+        "ttft_ms": round(ttft_ms, 1),
+        "tkg_tok_s": round(tok_s, 1),
+        "ttft_delta_pct": round(ttft_delta, 1),
+        "tkg_delta_pct": round(tkg_delta, 1),
+        "baseline": BASELINE,
+        "compile_time_s": round(compile_time, 1) if compile_time else None,
+    }
+
+    with open(RESULTS_PATH, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"\nResults saved to {RESULTS_PATH}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/InternVL3-8B-Instruct/scale_test.py b/contrib/models/InternVL3-8B-Instruct/scale_test.py
new file mode 100644
index 00000000..c8fbed36
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/scale_test.py
@@ -0,0 +1,474 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Task 015-016: Sequence length and batch size scaling tests for InternVL3-8B-Instruct.
+
+Tests compilation and inference at various seq_len and batch_size configurations.
+Records compilation time, TTFT, TKG throughput, and HBM usage.
+
+Usage:
+    # Run all scaling tests (compiles each configuration)
+    python scale_test.py
+
+    # Test specific seq_len
+    python scale_test.py --seq-len 4096
+
+    # Test specific batch size at given seq_len
+    python scale_test.py --seq-len 4096 --batch-size 4
+
+    # Skip compilation (use existing compiled models)
+    python scale_test.py --skip-compile --seq-len 2048
+
+Target: trn2.3xlarge LNC=2 TP=4
+"""
+
+import argparse
+import gc
+import json
+import os
+import sys
+import time
+import traceback
+from pathlib import Path
+
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_BASE = "/mnt/models/neuron_models/"
+RESULTS_PATH = "/mnt/models/scale_test_results.json"
+
+# Special token IDs
+IMG_CONTEXT_ID = 151667
+IMG_START_ID = 151665
+IMG_END_ID = 151666
+
+
+def get_compiled_path(seq_len, batch_size):
+    """Get compiled model path for a specific configuration."""
+    return os.path.join(COMPILED_BASE, f"InternVL3-8B-s{seq_len}-b{batch_size}")
+
+
+def create_config(seq_len, batch_size):
+    """Create InternVL3 config for given seq_len and batch_size."""
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=batch_size,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    return config
+
+
+def get_hbm_usage():
+    """Get current HBM usage from neuron-monitor."""
+    try:
+        import subprocess
+
+        result = subprocess.run(
+            [
+                "neuron-monitor",
+                "-c",
+                '{"period":"1s","neuron_runtimes":[{"tag_filter":".*","metrics":[{"type":"memory_info"}]}]}',
+            ],
+            capture_output=True,
+            text=True,
+            timeout=5,
+        )
+        # Simple parsing - just get overall usage
+        if result.returncode == 0:
+            import re
+
+            matches = re.findall(r'"hbm_total_bytes":\s*(\d+)', result.stdout)
+            if matches:
+                return int(matches[0]) / (1024**3)  # GB
+    except Exception:
+        pass
+    return None
+
+
+def measure_ttft(model, tokenizer, prompt, seq_len, batch_size, pixel_values=None):
+    """Measure time to first token (TTFT) for a given prompt."""
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+
+    if pixel_values is not None:
+        # Build multimodal input
+        text_ids = input_ids[0]
+        img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+        input_ids = torch.cat(
+            [
+                text_ids,
+                torch.tensor([IMG_START_ID]),
+                img_tokens,
+                torch.tensor([IMG_END_ID]),
+            ]
+        ).unsqueeze(0)
+
+    # Repeat for batch
+    if batch_size > 1:
+        input_ids = input_ids.repeat(batch_size, 1)
+
+    actual_seq_len = input_ids.shape[-1]
+    position_ids = torch.arange(actual_seq_len, dtype=torch.int32).unsqueeze(0)
+    if batch_size > 1:
+        position_ids = position_ids.repeat(batch_size, 1)
+    seq_ids = torch.arange(batch_size, dtype=torch.int32)
+
+    # Warmup
+    model.reset()
+    with torch.no_grad():
+        model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+            pixel_values=pixel_values,
+        )
+
+    # Measure TTFT (average of 3 runs)
+    ttft_times = []
+    for _ in range(3):
+        model.reset()
+        start = time.perf_counter()
+        with torch.no_grad():
+            outputs = model(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                seq_ids=seq_ids,
+                pixel_values=pixel_values,
+            )
+        ttft = time.perf_counter() - start
+        ttft_times.append(ttft)
+
+    avg_ttft = sum(ttft_times) / len(ttft_times)
+    return avg_ttft, actual_seq_len
+
+
+def measure_tkg_throughput(model, tokenizer, prompt, num_tokens=32, batch_size=1):
+    """Measure TKG throughput (tokens/sec) for autoregressive generation."""
+    from neuronx_distributed_inference.utils.hf_adapter import (
+        HuggingFaceGenerationAdapter,
+    )
+
+    adapter = HuggingFaceGenerationAdapter(model)
+
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    if batch_size > 1:
+        input_ids = input_ids.repeat(batch_size, 1)
+
+    attention_mask = torch.ones_like(input_ids)
+
+    # Warmup
+    with torch.no_grad():
+        adapter.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=4,
+            do_sample=False,
+        )
+
+    # Measure (average of 3 runs)
+    tkg_times = []
+    for _ in range(3):
+        start = time.perf_counter()
+        with torch.no_grad():
+            out = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_tokens,
+                do_sample=False,
+            )
+        elapsed = time.perf_counter() - start
+        tkg_times.append(elapsed)
+
+    avg_time = sum(tkg_times) / len(tkg_times)
+    # Total tokens generated = num_tokens * batch_size
+    total_tokens = num_tokens * batch_size
+    throughput = total_tokens / avg_time
+
+    return throughput, avg_time, num_tokens
+
+
+def run_test(seq_len, batch_size, skip_compile=False):
+    """Run a single scaling test configuration."""
+    compiled_path = get_compiled_path(seq_len, batch_size)
+    result = {
+        "seq_len": seq_len,
+        "batch_size": batch_size,
+        "compiled_path": compiled_path,
+        "status": "pending",
+    }
+
+    print(f"\n{'=' * 60}")
+    print(f"Testing: seq_len={seq_len}, batch_size={batch_size}")
+    print(f"{'=' * 60}")
+
+    try:
+        # Create config
+        config = create_config(seq_len, batch_size)
+        model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+
+        # Compile
+        if not skip_compile:
+            print(f"\nCompiling to {compiled_path}...")
+            start = time.time()
+            model.compile(compiled_path)
+            compile_time = time.time() - start
+            result["compile_time_s"] = round(compile_time, 1)
+            print(f"Compilation: {compile_time:.1f}s ({compile_time / 60:.1f} min)")
+        else:
+            print(f"\nSkipping compile, loading from {compiled_path}")
+            result["compile_time_s"] = "skipped"
+
+        # Load
+        print("Loading compiled model...")
+        start = time.time()
+        model.load(compiled_path)
+        load_time = time.time() - start
+        result["load_time_s"] = round(load_time, 1)
+        print(f"Load: {load_time:.1f}s")
+
+        # HBM usage
+        hbm = get_hbm_usage()
+        if hbm:
+            result["hbm_usage_gb"] = round(hbm, 2)
+            print(f"HBM usage: {hbm:.2f} GB")
+
+        # Tokenizer
+        from transformers import AutoTokenizer
+
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+        # Test 1: Text-only TTFT
+        print("\n--- Text-only TTFT ---")
+        prompt = "What is the capital of France?"
+        ttft, actual_seq = measure_ttft(model, tokenizer, prompt, seq_len, batch_size)
+        result["text_ttft_ms"] = round(ttft * 1000, 1)
+        result["text_ttft_input_len"] = actual_seq
+        print(
+            f"TTFT: {ttft * 1000:.1f} ms (input_len={actual_seq}, batch={batch_size})"
+        )
+
+        # Test 2: TKG throughput
+        print("\n--- TKG throughput ---")
+        throughput, gen_time, num_tok = measure_tkg_throughput(
+            model, tokenizer, prompt, num_tokens=32, batch_size=batch_size
+        )
+        result["tkg_throughput_tok_s"] = round(throughput, 1)
+        result["tkg_gen_time_s"] = round(gen_time, 3)
+        result["tkg_num_tokens"] = num_tok
+        print(
+            f"Throughput: {throughput:.1f} tok/s ({gen_time:.3f}s for {num_tok} tokens, batch={batch_size})"
+        )
+
+        # Test 3: Multimodal TTFT (only for batch_size=1)
+        if batch_size == 1:
+            print("\n--- Multimodal TTFT ---")
+            pixel_values = torch.randn(1, 3, 448, 448)
+            mm_ttft, mm_seq = measure_ttft(
+                model,
+                tokenizer,
+                "Describe this image:",
+                seq_len,
+                batch_size,
+                pixel_values=pixel_values,
+            )
+            result["multimodal_ttft_ms"] = round(mm_ttft * 1000, 1)
+            result["multimodal_input_len"] = mm_seq
+            print(f"Multimodal TTFT: {mm_ttft * 1000:.1f} ms (input_len={mm_seq})")
+
+        # Test 4: Long prompt TTFT (fill most of seq_len)
+        if seq_len >= 4096:
+            print(f"\n--- Long prompt TTFT (target ~{seq_len // 2} tokens) ---")
+            # Generate a long prompt by repeating text
+            base_text = "The quick brown fox jumps over the lazy dog. " * 100
+            long_inputs = tokenizer(
+                base_text, return_tensors="pt", max_length=seq_len // 2, truncation=True
+            )
+            long_prompt_len = long_inputs.input_ids.shape[-1]
+            long_ids = long_inputs.input_ids
+            if batch_size > 1:
+                long_ids = long_ids.repeat(batch_size, 1)
+            long_pos = torch.arange(long_prompt_len, dtype=torch.int32).unsqueeze(0)
+            if batch_size > 1:
+                long_pos = long_pos.repeat(batch_size, 1)
+            long_seq_ids = torch.arange(batch_size, dtype=torch.int32)
+
+            model.reset()
+            long_ttft_times = []
+            for _ in range(3):
+                model.reset()
+                start = time.perf_counter()
+                with torch.no_grad():
+                    model(
+                        input_ids=long_ids, position_ids=long_pos, seq_ids=long_seq_ids
+                    )
+                long_ttft_times.append(time.perf_counter() - start)
+
+            long_ttft = sum(long_ttft_times) / len(long_ttft_times)
+            result["long_ttft_ms"] = round(long_ttft * 1000, 1)
+            result["long_ttft_input_len"] = long_prompt_len
+            print(
+                f"Long TTFT: {long_ttft * 1000:.1f} ms (input_len={long_prompt_len}, batch={batch_size})"
+            )
+
+        result["status"] = "PASS"
+        print(f"\nResult: PASS")
+
+    except Exception as e:
+        result["status"] = "FAIL"
+        result["error"] = str(e)
+        result["traceback"] = traceback.format_exc()
+        print(f"\nResult: FAIL - {e}")
+
+    finally:
+        # Cleanup to free HBM
+        if "model" in locals():
+            del model
+        gc.collect()
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser(description="InternVL3 scaling tests")
+    parser.add_argument(
+        "--seq-len",
+        type=int,
+        default=None,
+        help="Test specific seq_len (default: test 2048, 4096, 8192)",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=None,
+        help="Test specific batch_size (default: test 1, 2, 4)",
+    )
+    parser.add_argument(
+        "--skip-compile",
+        action="store_true",
+        help="Skip compilation, use existing compiled models",
+    )
+    args = parser.parse_args()
+
+    print("=" * 60)
+    print("InternVL3-8B-Instruct: Scaling Tests (Tasks 015-016)")
+    print("=" * 60)
+    print(f"Instance: trn2.3xlarge LNC=2 TP=4")
+    print(f"Model: {MODEL_PATH}")
+
+    all_results = []
+
+    if args.seq_len and args.batch_size:
+        # Single specific test
+        result = run_test(args.seq_len, args.batch_size, args.skip_compile)
+        all_results.append(result)
+    elif args.seq_len:
+        # Test specific seq_len with batch_size sweep
+        for bs in [1, 2, 4]:
+            result = run_test(args.seq_len, bs, args.skip_compile)
+            all_results.append(result)
+            if result["status"] == "FAIL":
+                print(f"\nStopping batch sweep at batch_size={bs} (failed)")
+                break
+    elif args.batch_size:
+        # Test specific batch_size with seq_len sweep
+        for sl in [2048, 4096, 8192]:
+            result = run_test(sl, args.batch_size, args.skip_compile)
+            all_results.append(result)
+            if result["status"] == "FAIL":
+                print(f"\nStopping seq_len sweep at seq_len={sl} (failed)")
+                break
+    else:
+        # Full scaling matrix
+        # Phase 1: seq_len sweep at batch_size=1
+        print("\n\n*** PHASE 1: Sequence Length Sweep (batch_size=1) ***")
+        max_passing_seq_len = 2048
+        for sl in [2048, 4096, 8192]:
+            result = run_test(sl, 1, args.skip_compile)
+            all_results.append(result)
+            if result["status"] == "PASS":
+                max_passing_seq_len = sl
+            else:
+                print(
+                    f"\nSeq_len ceiling: {sl} fails. Max passing: {max_passing_seq_len}"
+                )
+                break
+
+        # Phase 2: batch_size sweep at recommended seq_len
+        recommended_seq_len = min(max_passing_seq_len, 4096)
+        print(f"\n\n*** PHASE 2: Batch Size Sweep (seq_len={recommended_seq_len}) ***")
+        for bs in [2, 4]:
+            result = run_test(recommended_seq_len, bs, args.skip_compile)
+            all_results.append(result)
+            if result["status"] == "FAIL":
+                print(
+                    f"\nBatch size ceiling at seq_len={recommended_seq_len}: bs={bs} fails"
+                )
+                break
+
+    # Summary
+    print("\n\n" + "=" * 60)
+    print("SCALING TEST SUMMARY")
+    print("=" * 60)
+
+    print(
+        f"\n{'seq_len':<10} {'batch':<8} {'status':<8} {'compile':<12} {'TTFT(ms)':<12} {'tok/s':<10} {'mm_TTFT':<12}"
+    )
+    print("-" * 72)
+    for r in all_results:
+        compile_t = r.get("compile_time_s", "?")
+        if isinstance(compile_t, (int, float)):
+            compile_str = f"{compile_t:.0f}s"
+        else:
+            compile_str = str(compile_t)
+
+        ttft = r.get("text_ttft_ms", "?")
+        ttft_str = f"{ttft}" if isinstance(ttft, str) else f"{ttft:.1f}"
+
+        tps = r.get("tkg_throughput_tok_s", "?")
+        tps_str = f"{tps}" if isinstance(tps, str) else f"{tps:.1f}"
+
+        mm = r.get("multimodal_ttft_ms", "-")
+        mm_str = f"{mm}" if isinstance(mm, str) else f"{mm:.1f}"
+
+        print(
+            f"{r['seq_len']:<10} {r['batch_size']:<8} {r['status']:<8} {compile_str:<12} {ttft_str:<12} {tps_str:<10} {mm_str:<12}"
+        )
+
+    # Save results
+    with open(RESULTS_PATH, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {RESULTS_PATH}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/InternVL3-8B-Instruct/src/__init__.py b/contrib/models/InternVL3-8B-Instruct/src/__init__.py
new file mode 100644
index 00000000..88cea97a
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/src/__init__.py
@@ -0,0 +1,28 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from modeling_internvl3 import (
+    InternVL3InferenceConfig,
+    NeuronInternVL3ForCausalLM,
+)
+from modeling_internvl3_text import (
+    InternVL3TextModelWrapper,
+    NeuronInternVL3TextForCausalLM,
+    NeuronInternVL3TextModel,
+)
+from modeling_internvl3_vision import (
+    InternVL3VisionModelWrapper,
+    NeuronInternVL3VisionModel,
+    convert_vision_hf_to_neuron_state_dict,
+)
+
+__all__ = [
+    "InternVL3InferenceConfig",
+    "NeuronInternVL3ForCausalLM",
+    "InternVL3TextModelWrapper",
+    "NeuronInternVL3TextForCausalLM",
+    "NeuronInternVL3TextModel",
+    "InternVL3VisionModelWrapper",
+    "NeuronInternVL3VisionModel",
+    "convert_vision_hf_to_neuron_state_dict",
+]
diff --git a/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3.py b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3.py
new file mode 100644
index 00000000..763bcd33
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3.py
@@ -0,0 +1,585 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+NxDI contrib model for InternVL3-8B-Instruct (OpenGVLab).
+
+Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone.
+
+This is the top-level VLM class that inherits from NeuronBaseForImageToText,
+orchestrating both the vision encoder (separate NEFF) and text decoder (CTE + TKG NEFFs).
+
+Vision pipeline:
+  pixel_values [B,3,448,448] -> InternViT -> strip CLS -> pixel_shuffle -> MLP projector
+  -> [1, seq_len, 3584] padded vision embeddings
+
+Text pipeline:
+  input_ids -> embed_tokens -> scatter vision embeddings at <IMG_CONTEXT> positions
+  -> 28 Qwen2.5 decoder layers -> lm_head -> logits
+
+Special tokens:
+  <IMG_CONTEXT> = 151667 (image placeholder token in input_ids)
+  <img> = 151665 (image start)
+  </img> = 151666 (image end)
+"""
+
+import copy
+import json
+import logging
+import os
+from typing import Callable, Dict, List, Optional, Tuple, Type, Union
+
+import torch
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig
+from neuronx_distributed_inference.models.image_to_text_model_base import (
+    ImageToTextInferenceConfig,
+    NeuronBaseForImageToText,
+)
+from neuronx_distributed_inference.models.llama4.utils.encoder_utils import (
+    generate_positions_from_mask,
+    pad_positions,
+)
+from neuronx_distributed_inference.models.model_wrapper import VISION_ENCODER_MODEL_TAG
+
+from modeling_internvl3_text import (
+    InternVL3TextModelWrapper,
+    NeuronInternVL3TextForCausalLM,
+    NeuronInternVL3TextModel,
+)
+from modeling_internvl3_vision import (
+    InternVL3VisionModelWrapper,
+    NeuronInternVL3VisionModel,
+    convert_vision_hf_to_neuron_state_dict,
+)
+
+logger = logging.getLogger("Neuron")
+
+# InternVL3 special token ID for image context placeholder
+IMG_CONTEXT_TOKEN_ID = 151667
+
+# Keys from top-level config that must be copied to text_config
+INTERNVL3_TEXT_CONFIG_KEYS = [
+    "hidden_size",
+    "num_attention_heads",
+    "num_hidden_layers",
+    "num_key_value_heads",
+    "pad_token_id",
+    "vocab_size",
+    "intermediate_size",
+    "max_position_embeddings",
+    "rms_norm_eps",
+    "rope_theta",
+    "hidden_act",
+    "bos_token_id",
+    "eos_token_id",
+    "tie_word_embeddings",
+]
+
+
+# ---------------------------------------------------------------------------
+# Config
+# ---------------------------------------------------------------------------
+
+
+class InternVL3InferenceConfig(ImageToTextInferenceConfig):
+    """
+    Inference configuration for InternVL3 on Neuron.
+
+    Requires two NeuronConfig objects:
+    - text_neuron_config: for the Qwen2.5-7B text decoder (CTE + TKG)
+    - vision_neuron_config: for the InternViT-300M vision encoder
+
+    The HF config.json has text params under "llm_config" and vision params
+    under "vision_config". This class handles the mapping.
+    """
+
+    def __init__(
+        self,
+        text_neuron_config,
+        vision_neuron_config,
+        fused_spec_config=None,
+        load_config=None,
+        metadata: Optional[Dict] = None,
+        **kwargs,
+    ):
+        # Wrap load_config to map InternVL's llm_config -> text_config
+        # The NxDI base class calls load_config(self) which sets attributes from
+        # config.to_dict(). InternVL HF config has "llm_config" not "text_config",
+        # so we intercept to create text_config from llm_config before validation.
+        if load_config is not None:
+            original_load_config = load_config
+
+            def _patched_load_config(config_obj):
+                original_load_config(config_obj)
+                # Map llm_config -> text_config if needed
+                if (
+                    not hasattr(config_obj, "text_config")
+                    or config_obj.text_config is None
+                ):
+                    llm_cfg = getattr(config_obj, "llm_config", None)
+                    if llm_cfg is not None:
+                        # llm_config can be a dict, PretrainedConfig, or SimpleNamespace
+                        if isinstance(llm_cfg, dict):
+                            text_config_dict = dict(llm_cfg)
+                        elif hasattr(llm_cfg, "to_dict"):
+                            text_config_dict = llm_cfg.to_dict()
+                        elif hasattr(llm_cfg, "__dict__"):
+                            text_config_dict = dict(vars(llm_cfg))
+                        else:
+                            text_config_dict = {
+                                k: getattr(llm_cfg, k)
+                                for k in dir(llm_cfg)
+                                if not k.startswith("_")
+                            }
+                        # Set HF defaults needed by NxDI
+                        text_config_dict.setdefault("output_attentions", False)
+                        text_config_dict.setdefault("output_hidden_states", False)
+                        config_obj.text_config = text_config_dict
+                        # Also propagate text config keys to top level
+                        for key in INTERNVL3_TEXT_CONFIG_KEYS:
+                            if key in text_config_dict and not hasattr(config_obj, key):
+                                setattr(config_obj, key, text_config_dict[key])
+
+            load_config = _patched_load_config
+
+        super().__init__(
+            text_neuron_config=text_neuron_config,
+            vision_neuron_config=vision_neuron_config,
+            fused_spec_config=fused_spec_config,
+            load_config=load_config,
+            metadata=metadata,
+            **kwargs,
+        )
+        self.add_special_config()
+        self.validate_model_supported_configs()
+
+    def add_special_config(self):
+        """Set InternVL3-specific config defaults."""
+        self.num_cores_per_group = 1
+
+        # Qwen2.5 text backbone: QKV bias, no O bias
+        self.qkv_bias = True
+        self.o_bias = False
+
+        # Image token ID for vision mask generation
+        self.image_token_id = IMG_CONTEXT_TOKEN_ID
+
+        # Copy text keys from top-level config to text_config
+        for key in INTERNVL3_TEXT_CONFIG_KEYS:
+            if hasattr(self, key):
+                setattr(self.text_config, key, getattr(self, key))
+
+        self.pad_token_id = getattr(self.text_config, "pad_token_id", 0)
+
+    def validate_model_supported_configs(self):
+        """Validate and disable unsupported NeuronConfig options."""
+        # Validate text config keys match
+        for key in INTERNVL3_TEXT_CONFIG_KEYS:
+            if hasattr(self, key) and hasattr(self.text_config, key):
+                top_val = getattr(self, key)
+                text_val = getattr(self.text_config, key)
+                if top_val != text_val:
+                    logger.warning(
+                        f"Config mismatch: {key} top={top_val} vs text={text_val}, using top"
+                    )
+                    setattr(self.text_config, key, top_val)
+
+        # Disable unsupported text model features
+        TEXT_UNSUPPORTED = [
+            "is_block_kv_layout",
+            "is_prefix_caching",
+            "is_chunked_prefill",
+            "is_medusa",
+            "enable_fused_speculation",
+        ]
+        for cfg_name in TEXT_UNSUPPORTED:
+            if getattr(self.text_config.neuron_config, cfg_name, False) is not False:
+                setattr(self.text_config.neuron_config, cfg_name, False)
+                logger.warning(
+                    f"InternVL3 text model: '{cfg_name}' unsupported, disabled."
+                )
+
+        # Disable unsupported vision model features
+        VISION_UNSUPPORTED = [
+            "sequence_parallel_enabled",
+            "flash_decoding_enabled",
+            "qkv_kernel_enabled",
+            "attn_block_tkg_nki_kernel_cache_update",
+            "attn_block_tkg_nki_kernel_enabled",
+        ]
+        for cfg_name in VISION_UNSUPPORTED:
+            if getattr(self.vision_config.neuron_config, cfg_name, False) is not False:
+                setattr(self.vision_config.neuron_config, cfg_name, False)
+                logger.warning(
+                    f"InternVL3 vision model: '{cfg_name}' unsupported, disabled."
+                )
+
+    def get_required_attributes(self) -> List[str]:
+        return [
+            "text_config",
+            "vision_config",
+            "text_config.hidden_size",
+            "text_config.num_attention_heads",
+            "text_config.num_hidden_layers",
+            "text_config.num_key_value_heads",
+            "text_config.pad_token_id",
+            "text_config.vocab_size",
+            "text_config.max_position_embeddings",
+            "text_config.rope_theta",
+            "text_config.rms_norm_eps",
+            "text_config.hidden_act",
+            "vision_config.hidden_size",
+            "vision_config.num_attention_heads",
+            "vision_config.num_hidden_layers",
+            "vision_config.image_size",
+            "vision_config.patch_size",
+        ]
+
+    @classmethod
+    def get_neuron_config_cls(cls) -> Type[NeuronConfig]:
+        return NeuronConfig
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        model_path: str,
+        text_neuron_config=None,
+        vision_neuron_config=None,
+        **kwargs,
+    ) -> "InternVL3InferenceConfig":
+        """
+        Load configuration from a pretrained InternVL3 model directory.
+
+        InternVL3 config.json structure:
+          - Top-level: model_type, downsample_ratio, force_image_size, etc.
+          - llm_config: Qwen2.5-7B text params (model_type=qwen2)
+          - vision_config: InternViT params (model_type=intern_vit_6b)
+
+        The ImageToTextInferenceConfig parent expects "text_config" and
+        "vision_config" keys in kwargs. We map llm_config -> text_config.
+        """
+        config_path = os.path.join(model_path, "config.json")
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"config.json not found at {config_path}")
+
+        with open(config_path, "r") as f:
+            config_dict = json.load(f)
+
+        # Extract text config (InternVL uses "llm_config", not "text_config")
+        llm_config = config_dict.get("llm_config", {})
+        vision_config = config_dict.get("vision_config", {})
+
+        # Build the config dict that ImageToTextInferenceConfig expects
+        # Must have "text_config" and "vision_config" at top level
+        inference_kwargs = {}
+
+        # Copy top-level InternVL params
+        for key in [
+            "downsample_ratio",
+            "force_image_size",
+            "select_layer",
+            "model_type",
+            "architectures",
+            "tie_word_embeddings",
+        ]:
+            if key in config_dict:
+                inference_kwargs[key] = config_dict[key]
+
+        # Copy text params from llm_config -> text_config
+        text_config_dict = {}
+        for key in [
+            "hidden_size",
+            "num_attention_heads",
+            "num_hidden_layers",
+            "num_key_value_heads",
+            "vocab_size",
+            "max_position_embeddings",
+            "rope_theta",
+            "rms_norm_eps",
+            "hidden_act",
+            "intermediate_size",
+            "pad_token_id",
+            "bos_token_id",
+            "eos_token_id",
+            "tie_word_embeddings",
+        ]:
+            if key in llm_config:
+                text_config_dict[key] = llm_config[key]
+
+        # HF PretrainedConfig defaults required by NxDI model_base._setup_func_config()
+        # These are normally set by HF's from_pretrained() but we build config manually
+        text_config_dict.setdefault("output_attentions", False)
+        text_config_dict.setdefault("output_hidden_states", False)
+        # Note: do NOT set use_return_dict here — it's a read-only computed attribute
+        # in PretrainedConfig and will raise AttributeError when HuggingFaceGenerationAdapter
+        # converts our config via to_pretrained_config() → PretrainedConfig(**text_config_dict)
+
+        # Also set at top level (required by ImageToTextInferenceConfig)
+        for key, value in text_config_dict.items():
+            inference_kwargs[key] = value
+
+        inference_kwargs["text_config"] = text_config_dict
+
+        # Copy vision config as-is
+        inference_kwargs["vision_config"] = vision_config
+
+        # Set image_token_id
+        inference_kwargs["image_token_id"] = IMG_CONTEXT_TOKEN_ID
+
+        # Set _name_or_path
+        inference_kwargs["_name_or_path"] = model_path
+
+        # Merge user kwargs
+        inference_kwargs.update(kwargs)
+
+        return cls(
+            text_neuron_config=text_neuron_config,
+            vision_neuron_config=vision_neuron_config,
+            **inference_kwargs,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Top-level VLM class
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3ForCausalLM(NeuronBaseForImageToText):
+    """
+    InternVL3 vision-language model for Neuron inference.
+
+    Orchestrates:
+    - Vision encoder NEFF (InternViT-300M + pixel shuffle + projector)
+    - Text decoder CTE NEFF (Qwen2.5-7B context encoding with vision embedding injection)
+    - Text decoder TKG NEFF (Qwen2.5-7B token generation)
+
+    Usage:
+        text_nc = NeuronConfig(tp_degree=4, max_batch_size=1, seq_len=4096, ...)
+        vision_nc = NeuronConfig(tp_degree=1, max_batch_size=1, buckets=[1], ...)
+        config = InternVL3InferenceConfig.from_pretrained(
+            model_path, text_neuron_config=text_nc, vision_neuron_config=vision_nc
+        )
+        model = NeuronInternVL3ForCausalLM(config)
+        model.compile(compiled_path)
+        model.load(compiled_path)
+        output = model(input_ids, attention_mask, position_ids, seq_ids,
+                       sampling_params, pixel_values=pixel_values)
+    """
+
+    text_model_cls = NeuronInternVL3TextModel
+    vision_model_cls = NeuronInternVL3VisionModel
+    text_model_wrapper = InternVL3TextModelWrapper
+    vision_model_wrapper = InternVL3VisionModelWrapper
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(
+            self.text_model_cls,
+            self.vision_model_cls,
+            self.text_model_wrapper,
+            self.vision_model_wrapper,
+            *args,
+            **kwargs,
+        )
+
+    def get_vision_compiler_args(self) -> str:
+        """Compiler args for vision encoder NEFF."""
+        return "--auto-cast=matmult --model-type=transformer -O1"
+
+    def get_compiler_args(self) -> str:
+        """Compiler args for text model NEFFs (CTE + TKG)."""
+        return "--auto-cast=matmult --model-type=transformer -O1"
+
+    def get_required_kwargs(self) -> List[str]:
+        """Additional input args for HuggingFaceGenerationAdapter."""
+        return ["pixel_values", "vision_mask"]
+
+    def enable_vision_encoder(
+        self, enable_wlt_optimization: bool = True, **model_init_kwargs
+    ):
+        """Create the vision encoder model wrapper."""
+        new_config = copy.deepcopy(self.config)
+        self.vision_encoder_model = self.vision_model_wrapper(
+            config=new_config,
+            model_cls=self.vision_model_cls,
+            tag=VISION_ENCODER_MODEL_TAG,
+            compiler_args=self.get_vision_compiler_args(),
+            model_init_kwargs=model_init_kwargs,
+            priority_model_idx=(0 if enable_wlt_optimization else None),
+            pipeline_execution=True,
+            return_ranked_to_cpu=False,
+        )
+        self.vision_models.append(self.vision_encoder_model)
+
+    def get_padding_length(self, input_ids):
+        """Get the CTE bucket size for the given input length."""
+        buckets = self.context_encoding_model.config.neuron_config.buckets
+        for val in buckets:
+            if val >= input_ids.shape[1]:
+                return val
+        raise RuntimeError(
+            f"No bucket found for input_ids length {input_ids.shape[1]}. "
+            f"Available buckets: {buckets}"
+        )
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        seq_ids: Optional[torch.LongTensor] = None,
+        sampling_params: Optional[torch.FloatTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        vision_mask: Optional[torch.FloatTensor] = None,
+        adapter_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        use_cache: Optional[bool] = None,
+        medusa_args=None,
+        input_capture_hook: Optional[Callable] = None,
+        tensor_capture_hook: Optional[Callable] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        """
+        Forward pass orchestrating vision encoder and text decoder.
+
+        For context encoding with images:
+          1. Identify <IMG_CONTEXT> token positions in input_ids
+          2. Run vision encoder NEFF -> padded vision embeddings
+          3. Pass vision_embeddings + vision_mask to text decoder CTE
+          4. Inside CTE NEFF: embed_tokens -> scatter vision -> decoder layers -> logits
+
+        For token generation or text-only:
+          - Pass dummy (zero) vision tensors to text decoder TKG
+        """
+        # Work around NxDI issue: NeuronBaseForImageToText.forward() doesn't
+        # capture preprocess_inputs() return values, so sampling_params=None
+        # flows through to the compiled NEFF which expects [batch, 3].
+        # Provide default sampling_params here if not supplied by caller.
+        if sampling_params is None:
+            sampling_params = self.default_sampling_params
+
+        pad_limit = self.get_padding_length(input_ids)
+
+        if (
+            pixel_values is not None
+            and input_ids.shape[-1] > 1
+            and pixel_values.sum() != 0
+        ):
+            # Context encoding with images
+            # Build vision_mask: find <IMG_CONTEXT> positions in input_ids
+            vision_mask = (input_ids == self.config.image_token_id).unsqueeze(-1)
+            vision_mask = vision_mask.to(torch.bool)
+            vision_mask = generate_positions_from_mask(vision_mask.squeeze())
+            vision_mask = pad_positions(vision_mask, pad_limit, (pad_limit - 1))
+
+            # Run vision encoder NEFF
+            # The NEFF is compiled for batch=1 (single image). For multi-image
+            # requests, loop over images and concatenate vision embeddings.
+            num_images = pixel_values.shape[0]
+            pv_dtype = self.vision_config.neuron_config.torch_dtype
+            if num_images == 1:
+                vision_embeddings = self.vision_encoder_model(pixel_values.to(pv_dtype))
+            else:
+                # Each NEFF call returns [1, text_seq_len, hidden] with the first
+                # tokens_per_image positions containing real embeddings (rest are
+                # zero-padded by pad_to_text_seq_len inside the NEFF).
+                # Extract real tokens from each call, concatenate, re-pad.
+                tokens_per_image = 256  # (448/14)^2 * downsample_ratio^2
+                emb_list = []
+                for i in range(num_images):
+                    emb_i = self.vision_encoder_model(
+                        pixel_values[i : i + 1].to(pv_dtype)
+                    )
+                    # Extract real tokens only: [1, tokens_per_image, hidden]
+                    emb_list.append(emb_i[:, :tokens_per_image, :])
+                # Concatenate: [1, num_images * tokens_per_image, hidden]
+                vision_embeddings = torch.cat(emb_list, dim=1)
+                # Pad to CTE bucket size (pad_limit)
+                total_vis_tokens = vision_embeddings.shape[1]
+                if total_vis_tokens < pad_limit:
+                    pad_zeros = torch.zeros(
+                        1,
+                        pad_limit - total_vis_tokens,
+                        vision_embeddings.shape[2],
+                        dtype=vision_embeddings.dtype,
+                        device=vision_embeddings.device,
+                    )
+                    vision_embeddings = torch.cat([vision_embeddings, pad_zeros], dim=1)
+                elif total_vis_tokens > pad_limit:
+                    vision_embeddings = vision_embeddings[:, :pad_limit, :]
+        else:
+            # Token generation or text-only: use dummy zeros
+            vision_embeddings, vision_mask = (
+                self.text_model_wrapper.get_dummy_vision_inputs(
+                    config=self.text_config,
+                    input_ids=input_ids,
+                    n_active_tokens=pad_limit,
+                    fill_value=(pad_limit - 1),
+                )
+            )
+
+        output_token = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+            sampling_params=sampling_params,
+            input_capture_hook=input_capture_hook,
+            tensor_capture_hook=tensor_capture_hook,
+            vision_embeddings=vision_embeddings,
+            vision_mask=vision_mask,
+        )
+        return output_token
+
+    @classmethod
+    def get_config_cls(cls):
+        return InternVL3InferenceConfig
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        """Load original HF InternVL3 model for CPU reference inference.
+
+        Uses FP32 to avoid -inf logits that occur with BF16,
+        which cause false failures in logit_validation.
+        """
+        from transformers import AutoModel
+
+        model = AutoModel.from_pretrained(
+            model_path, trust_remote_code=True, torch_dtype=torch.float32
+        ).eval()
+        # HF generate() requires img_context_token_id to be set
+        model.img_context_token_id = IMG_CONTEXT_TOKEN_ID
+        return model
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: dict, inference_config: InternVL3InferenceConfig
+    ) -> dict:
+        """
+        Convert full InternVL3 HF state dict to Neuron format.
+
+        Delegates to:
+        - convert_vision_hf_to_neuron_state_dict() for vision + projector weights
+        - NeuronInternVL3TextForCausalLM.convert_hf_to_neuron_state_dict() for text weights
+        """
+        # Vision weights (encoder + projector)
+        vision_state_dict = convert_vision_hf_to_neuron_state_dict(state_dict)
+
+        # Text weights
+        text_state_dict = (
+            NeuronInternVL3TextForCausalLM.convert_hf_to_neuron_state_dict(
+                state_dict, inference_config.text_config
+            )
+        )
+
+        # Merge (vision and text keys should not overlap)
+        merged = {}
+        merged.update(vision_state_dict)
+        merged.update(text_state_dict)
+        return merged
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict):
+        """Handle tied embeddings."""
+        NeuronInternVL3TextForCausalLM.update_state_dict_for_tied_weights(state_dict)
diff --git a/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_text.py b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_text.py
new file mode 100644
index 00000000..be8f2094
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_text.py
@@ -0,0 +1,445 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+InternVL3-8B-Instruct: Text backbone (Qwen2.5-7B) for NxDI.
+
+This module contains the text decoder model, attention, decoder layers,
+text model wrapper (ImageToTextModelWrapper), and weight conversion.
+
+Text backbone: Qwen2.5-7B
+- 28 layers, hidden_size=3584, 28 Q heads, 4 KV heads (GQA 7:1)
+- Standard RoPE (rope_theta=1e6), RMSNorm, SiLU gated MLP
+- QKV bias=True, O bias=False
+- vocab_size=151674, tie_word_embeddings=False
+
+Weight key mapping (HF -> NxDI):
+  language_model.model.layers.{i}.* -> layers.{i}.*
+  language_model.model.embed_tokens.weight -> embed_tokens.weight
+  language_model.model.norm.weight -> norm.weight
+  language_model.lm_head.weight -> lm_head.weight
+"""
+
+import torch
+from torch import nn
+
+from transformers.models.qwen2.modeling_qwen2 import Qwen2RMSNorm
+
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed.utils import cpu_mode
+
+from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig
+from neuronx_distributed_inference.models.image_to_text_model_wrapper import (
+    ImageToTextModelWrapper,
+)
+from neuronx_distributed_inference.models.llama.modeling_llama import NeuronLlamaMLP
+from neuronx_distributed_inference.models.llama4.utils.encoder_utils import (
+    scatter_by_index_put,
+)
+from neuronx_distributed_inference.models.model_base import (
+    NeuronBaseForCausalLM,
+    NeuronBaseModel,
+)
+from neuronx_distributed_inference.modules.attention.attention_base import (
+    NeuronAttentionBase,
+)
+from neuronx_distributed_inference.modules.attention.utils import RotaryEmbedding
+from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
+
+
+def get_rmsnorm_cls():
+    """Get RMSNorm implementation: HF for CPU, CustomRMSNorm for Neuron."""
+    return Qwen2RMSNorm if cpu_mode() else CustomRMSNorm
+
+
+# ---------------------------------------------------------------------------
+# Attention
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3Attention(NeuronAttentionBase):
+    """
+    InternVL3 text attention: GQA with standard RoPE and QKV bias.
+
+    - 28 Q heads, 4 KV heads (7:1 GQA ratio)
+    - head_dim = 128
+    - Q/K/V have bias, O does not
+    - Standard RoPE (not M-RoPE)
+    - No Q-K normalization (unlike Qwen3)
+    """
+
+    def __init__(self, config):
+        head_dim = getattr(
+            config,
+            "head_dim",
+            config.hidden_size // config.num_attention_heads,
+        )
+        rotary_emb = RotaryEmbedding(
+            dim=head_dim,
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta,
+        )
+        super().__init__(
+            config=config,
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_key_value_heads,
+            head_dim=head_dim,
+            rotary_emb=rotary_emb,
+            qkv_bias=True,
+            o_bias=False,
+            rms_norm_eps=config.rms_norm_eps,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Decoder layer
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3DecoderLayer(nn.Module):
+    """
+    InternVL3 text decoder layer: pre-norm RMSNorm + GQA attention + SwiGLU MLP.
+
+    Supports NKI kernel fused RMSNorm when qkv_kernel_enabled or mlp_kernel_enabled.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = NeuronInternVL3Attention(config)
+        self.mlp = NeuronLlamaMLP(config)
+        self.input_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.post_attention_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+
+        # NKI kernel flags — fuse RMSNorm into kernels when enabled
+        neuron_config = config.neuron_config
+        self.qkv_kernel_enabled = neuron_config.qkv_kernel_enabled
+        self.mlp_kernel_enabled = neuron_config.mlp_kernel_enabled
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask=None,
+        position_ids=None,
+        past_key_value=None,
+        **kwargs,
+    ):
+        residual = hidden_states
+
+        # When QKV kernel is enabled, pass the RMSNorm module to be fused
+        # into the NKI kernel instead of applying it separately
+        if self.qkv_kernel_enabled:
+            qkv_fused_rmsnorm = self.input_layernorm
+        else:
+            hidden_states = self.input_layernorm(hidden_states)
+            qkv_fused_rmsnorm = None
+
+        hidden_states, present_key_value, cos_cache, sin_cache = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            rmsnorm=qkv_fused_rmsnorm,
+            **kwargs,
+        )
+
+        hidden_states = residual + hidden_states
+
+        residual = hidden_states
+
+        # When MLP kernel is enabled, pass the RMSNorm module to be fused
+        if self.mlp_kernel_enabled:
+            hidden_states, _ = self.mlp(
+                hidden_states,
+                rmsnorm=self.post_attention_layernorm,
+            )
+        else:
+            hidden_states = self.post_attention_layernorm(hidden_states)
+            hidden_states = self.mlp(hidden_states)[0]
+
+        hidden_states = residual + hidden_states
+
+        return (hidden_states, present_key_value, cos_cache, sin_cache, None)
+
+
+# ---------------------------------------------------------------------------
+# Text model wrapper (ImageToTextModelWrapper)
+# ---------------------------------------------------------------------------
+
+
+class InternVL3TextModelWrapper(ImageToTextModelWrapper):
+    """
+    Text model wrapper for InternVL3 that includes vision embedding inputs
+    in the compiled NEFF trace signature.
+
+    Inherits ImageToTextModelWrapper which generates 24-argument input tuples
+    with vision_embeddings (arg 22) and vision_mask (arg 23).
+    """
+
+    def __init__(
+        self,
+        config,
+        model_cls,
+        tag="",
+        compiler_args=None,
+        priority_model_idx=None,
+        pipeline_execution=True,
+        return_ranked_to_cpu=True,
+        model_init_kwargs={},
+    ) -> None:
+        super().__init__(
+            config,
+            model_cls,
+            tag,
+            compiler_args,
+            priority_model_idx,
+            pipeline_execution,
+            return_ranked_to_cpu,
+            model_init_kwargs,
+        )
+
+    @staticmethod
+    def get_dummy_vision_inputs(config, input_ids, n_active_tokens, fill_value):
+        """
+        Create dummy vision tensors for tracing and text-only / token-gen passes.
+
+        For context encoding (seq_len > 1):
+          - vision_embeddings: [batch, seq_len, hidden_size] zeros
+          - vision_mask: [batch, n_active_tokens, 1] filled with fill_value (int32 positions)
+        For token generation (seq_len == 1):
+          - Both are empty tensors
+        """
+        input_batch_size, input_sequence_len = input_ids.shape[0], input_ids.shape[-1]
+        if input_sequence_len > 1:
+            vision_embeddings = torch.zeros(
+                input_batch_size,
+                config.neuron_config.seq_len,
+                config.hidden_size,
+                dtype=config.neuron_config.torch_dtype,
+            )
+            vision_mask = torch.full(
+                size=(input_batch_size, n_active_tokens, 1),
+                fill_value=fill_value,
+                dtype=torch.int32,
+            )
+        else:
+            vision_embeddings = torch.zeros((0), dtype=config.neuron_config.torch_dtype)
+            vision_mask = torch.zeros((0), dtype=torch.bool)
+        return vision_embeddings, vision_mask
+
+
+# ---------------------------------------------------------------------------
+# Text model (traced on Neuron)
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3TextModel(NeuronBaseModel):
+    """
+    InternVL3 text model (Qwen2.5-7B backbone) for NxDI.
+
+    Components:
+    - ParallelEmbedding (vocab=151674, hidden=3584)
+    - 28x NeuronInternVL3DecoderLayer
+    - RMSNorm
+    - ColumnParallelLinear lm_head
+
+    Implements encode_vision_to_input() for merging vision embeddings
+    into text embeddings during context encoding.
+    """
+
+    def setup_attr_for_model(self, config):
+        self.on_device_sampling = (
+            config.neuron_config.on_device_sampling_config is not None
+        )
+        self.tp_degree = config.neuron_config.tp_degree
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.max_batch_size = config.neuron_config.max_batch_size
+        self.buckets = config.neuron_config.buckets
+
+    def init_model(self, config):
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = ParallelEmbedding(
+            config.vocab_size,
+            config.hidden_size,
+            self.padding_idx,
+            dtype=config.neuron_config.torch_dtype,
+            shard_across_embedding=True,
+            pad=True,
+            sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+        )
+
+        self.layers = nn.ModuleList(
+            [
+                NeuronInternVL3DecoderLayer(config)
+                for _ in range(config.num_hidden_layers)
+            ]
+        )
+
+        self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.lm_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.vocab_size,
+            bias=False,
+            pad=True,
+            gather_output=not self.on_device_sampling,
+            dtype=config.neuron_config.torch_dtype,
+        )
+
+    def encode_vision_to_input(
+        self,
+        inputs_embeds: torch.Tensor,
+        vision_embeddings: torch.Tensor,
+        vision_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Scatter vision embeddings into text embeddings at image token positions.
+
+        Called by NeuronBaseModel.get_model_output() during context encoding only.
+        Runs ON-DEVICE inside the compiled NEFF.
+
+        Args:
+            inputs_embeds: [batch, seq_len, hidden_size] -- text token embeddings
+            vision_embeddings: [batch, seq_len, hidden_size] -- padded vision embeddings
+            vision_mask: [batch, seq_len, 1] -- int32 position indices
+
+        Returns:
+            inputs_embeds with vision positions replaced by vision embeddings
+        """
+        return scatter_by_index_put(inputs_embeds, vision_embeddings, vision_mask)
+
+
+# ---------------------------------------------------------------------------
+# Weight conversion helper (text-only, used by top-level model)
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3TextForCausalLM(NeuronBaseForCausalLM):
+    """
+    Helper class for text weight conversion only.
+    Not used directly for inference -- the top-level NeuronBaseForImageToText
+    class handles that via NeuronInternVL3TextModel.
+    """
+
+    _model_cls = NeuronInternVL3TextModel
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        return None
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: dict, config: InferenceConfig
+    ) -> dict:
+        """
+        Convert InternVL3 text weights from HuggingFace to Neuron format.
+
+        HF layout:
+          language_model.model.embed_tokens.weight
+          language_model.model.layers.{i}.self_attn.{q,k,v}_proj.{weight,bias}
+          language_model.model.layers.{i}.self_attn.o_proj.weight
+          language_model.model.layers.{i}.mlp.{gate,up,down}_proj.weight
+          language_model.model.layers.{i}.{input,post_attention}_layernorm.weight
+          language_model.model.norm.weight
+          language_model.lm_head.weight
+
+        NxDI layout (fused_qkv=False):
+          embed_tokens.weight
+          layers.{i}.self_attn.{q,k,v}_proj.{weight,bias}
+          layers.{i}.self_attn.o_proj.weight
+          layers.{i}.mlp.{gate,up,down}_proj.weight
+          layers.{i}.{input,post_attention}_layernorm.weight
+          norm.weight
+          lm_head.weight
+
+        NxDI layout (fused_qkv=True):
+          layers.{i}.self_attn.qkv_proj.Wqkv.{weight,bias}
+          (q/k/v fused into single Wqkv tensor)
+
+        When fused_qkv=False, separate q/k/v_proj are kept and the GQA
+        preshard_hook fuses them during weight sharding.
+
+        When fused_qkv=True (required for NKI QKV kernel), we pre-fuse
+        q/k/v into Wqkv here because the preshard_hook expects it.
+        """
+        neuron_config = config.neuron_config
+        neuron_state_dict = {}
+
+        # Add rank tensors for tensor parallelism
+        if neuron_config.vocab_parallel:
+            neuron_state_dict["embed_tokens.rank_util.rank"] = torch.arange(
+                0, neuron_config.local_ranks_size
+            )
+
+        for key, value in state_dict.items():
+            # Only process text weights (language_model.*)
+            if not key.startswith("language_model."):
+                continue
+
+            # Strip the language_model.model. prefix
+            new_key = key
+            if key.startswith("language_model.model."):
+                new_key = key[len("language_model.model.") :]
+            elif key.startswith("language_model."):
+                new_key = key[len("language_model.") :]
+
+            neuron_state_dict[new_key] = value.detach().clone()
+
+        # When fused_qkv=True, fuse separate q/k/v weights into Wqkv
+        if neuron_config.fused_qkv:
+            for i in range(config.num_hidden_layers):
+                prefix = f"layers.{i}.self_attn"
+                # Fuse weights: [q_size, hidden] + [kv_size, hidden] + [kv_size, hidden]
+                q_w = neuron_state_dict.pop(f"{prefix}.q_proj.weight")
+                k_w = neuron_state_dict.pop(f"{prefix}.k_proj.weight")
+                v_w = neuron_state_dict.pop(f"{prefix}.v_proj.weight")
+                neuron_state_dict[f"{prefix}.qkv_proj.Wqkv.weight"] = torch.cat(
+                    [q_w, k_w, v_w], dim=0
+                )
+                # Fuse biases (Q/K/V all have bias in InternVL3)
+                q_b = neuron_state_dict.pop(f"{prefix}.q_proj.bias", None)
+                k_b = neuron_state_dict.pop(f"{prefix}.k_proj.bias", None)
+                v_b = neuron_state_dict.pop(f"{prefix}.v_proj.bias", None)
+                if q_b is not None and k_b is not None and v_b is not None:
+                    neuron_state_dict[f"{prefix}.qkv_proj.Wqkv.bias"] = torch.cat(
+                        [q_b, k_b, v_b], dim=0
+                    )
+
+        # Add per-layer rank tensors for attention TP sharding
+        tp_degree = neuron_config.tp_degree
+        for i in range(config.num_hidden_layers):
+            neuron_state_dict[f"layers.{i}.self_attn.rank_util.rank"] = torch.arange(
+                0, tp_degree, dtype=torch.int32
+            )
+
+        # Add base model rank tensor
+        neuron_state_dict["rank_util.rank"] = torch.arange(
+            0, tp_degree, dtype=torch.int32
+        )
+
+        return neuron_state_dict
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict):
+        """Handle tied embeddings (InternVL3: tie_word_embeddings=False, but support both)."""
+        if "lm_head.weight" not in state_dict and "embed_tokens.weight" in state_dict:
+            state_dict["lm_head.weight"] = state_dict["embed_tokens.weight"].clone()
+
+    @classmethod
+    def get_config_cls(cls):
+        # This helper class doesn't need its own config -- used through top-level
+        from neuronx_distributed_inference.models.config import InferenceConfig
+
+        return InferenceConfig
diff --git a/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_vision.py b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_vision.py
new file mode 100644
index 00000000..0d5ee256
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/src/modeling_internvl3_vision.py
@@ -0,0 +1,447 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+InternVL3-8B-Instruct: Vision encoder (InternViT-300M) for NxDI.
+
+Architecture:
+  - InternViT-300M-448px-V2.5
+  - 24 layers, hidden_size=1024, 16 heads, head_dim=64
+  - Patch size 14x14, image 448x448 -> 1024 patches + CLS = 1025 tokens
+  - LayerNorm, GELU, LayerScale (ls1, ls2)
+  - Fused QKV: attn.qkv [3072, 1024]
+  - Position: absolute learned embeddings [1, 1025, 1024]
+
+Full vision pipeline:
+  1. Patch embedding + CLS + position -> 24 transformer layers -> [B, 1025, 1024]
+  2. Strip CLS token: [B, 1024, 1024]
+  3. Pixel shuffle (0.5x): [B, 256, 4096]
+  4. Projector MLP: LayerNorm(4096) -> Linear(4096,3584) -> GELU -> Linear(3584,3584)
+  5. Pad to text seq_len: [1, seq_len, 3584]
+
+Weight key mapping (HF -> NxDI vision):
+  vision_model.embeddings.class_embedding -> encoder.class_embedding
+  vision_model.embeddings.patch_embedding.{weight,bias} -> encoder.patch_embedding.{weight,bias}
+  vision_model.embeddings.position_embedding -> encoder.position_embedding
+  vision_model.encoder.layers.{i}.* -> encoder.layers.{i}.*
+  mlp1.0.{weight,bias} -> proj_norm.{weight,bias}
+  mlp1.1.{weight,bias} -> proj_linear1.{weight,bias}
+  mlp1.3.{weight,bias} -> proj_linear2.{weight,bias}
+"""
+
+import math
+from typing import List, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from neuronx_distributed_inference.models.config import InferenceConfig
+from neuronx_distributed_inference.models.model_wrapper import (
+    EncoderModelInstance,
+    ModelWrapper,
+)
+
+# InternVL3 vision constants
+VISION_HIDDEN_SIZE = 1024
+VISION_NUM_HEADS = 16
+VISION_NUM_LAYERS = 24
+VISION_INTERMEDIATE_SIZE = 4096
+VISION_PATCH_SIZE = 14
+VISION_IMAGE_SIZE = 448
+VISION_NUM_PATCHES = (VISION_IMAGE_SIZE // VISION_PATCH_SIZE) ** 2  # 1024
+VISION_NUM_OUTPUT_TOKENS = 256  # after pixel shuffle (0.5x): 1024 / 4
+TEXT_HIDDEN_SIZE = 3584
+DOWNSAMPLE_RATIO = 0.5
+
+
+# ---------------------------------------------------------------------------
+# Vision encoder components (pure PyTorch, for tracing)
+# ---------------------------------------------------------------------------
+
+
+class InternVisionAttention(nn.Module):
+    """InternViT attention with fused QKV projection."""
+
+    def __init__(self, hidden_size=VISION_HIDDEN_SIZE, num_heads=VISION_NUM_HEADS):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = hidden_size // num_heads
+        self.scale = self.head_dim**-0.5
+
+        self.qkv = nn.Linear(hidden_size, 3 * hidden_size, bias=True)
+        self.proj = nn.Linear(hidden_size, hidden_size, bias=True)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch, seq_len, _ = hidden_states.shape
+
+        qkv = self.qkv(hidden_states)
+        qkv = qkv.reshape(batch, seq_len, 3, self.num_heads, self.head_dim)
+        qkv = qkv.permute(2, 0, 3, 1, 4)  # [3, B, heads, seq, head_dim]
+        q, k, v = qkv.unbind(0)
+
+        # Scaled dot-product attention (no causal mask for vision)
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        out = attn @ v  # [B, heads, seq, head_dim]
+
+        out = out.transpose(1, 2).reshape(batch, seq_len, -1)
+        return self.proj(out)
+
+
+class InternVisionMLP(nn.Module):
+    """InternViT MLP: fc1 -> GELU -> fc2."""
+
+    def __init__(
+        self, hidden_size=VISION_HIDDEN_SIZE, intermediate_size=VISION_INTERMEDIATE_SIZE
+    ):
+        super().__init__()
+        self.fc1 = nn.Linear(hidden_size, intermediate_size, bias=True)
+        self.fc2 = nn.Linear(intermediate_size, hidden_size, bias=True)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.fc2(F.gelu(self.fc1(hidden_states)))
+
+
+class InternVisionLayer(nn.Module):
+    """InternViT transformer layer with pre-norm and LayerScale."""
+
+    def __init__(
+        self,
+        hidden_size=VISION_HIDDEN_SIZE,
+        num_heads=VISION_NUM_HEADS,
+        intermediate_size=VISION_INTERMEDIATE_SIZE,
+    ):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size)
+        self.attn = InternVisionAttention(hidden_size, num_heads)
+        self.ls1 = nn.Parameter(torch.ones(hidden_size) * 0.1)
+
+        self.norm2 = nn.LayerNorm(hidden_size)
+        self.mlp = InternVisionMLP(hidden_size, intermediate_size)
+        self.ls2 = nn.Parameter(torch.ones(hidden_size) * 0.1)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = hidden_states + self.ls1 * self.attn(self.norm1(hidden_states))
+        hidden_states = hidden_states + self.ls2 * self.mlp(self.norm2(hidden_states))
+        return hidden_states
+
+
+class InternVisionEncoder(nn.Module):
+    """
+    InternViT-300M vision encoder.
+
+    Takes pixel_values [B, 3, 448, 448], returns [B, 1025, 1024] (with CLS).
+    """
+
+    def __init__(
+        self,
+        hidden_size=VISION_HIDDEN_SIZE,
+        num_heads=VISION_NUM_HEADS,
+        num_layers=VISION_NUM_LAYERS,
+        intermediate_size=VISION_INTERMEDIATE_SIZE,
+        patch_size=VISION_PATCH_SIZE,
+        image_size=VISION_IMAGE_SIZE,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_patches = (image_size // patch_size) ** 2
+
+        self.patch_embedding = nn.Conv2d(
+            3, hidden_size, kernel_size=patch_size, stride=patch_size, bias=True
+        )
+        self.class_embedding = nn.Parameter(torch.zeros(1, 1, hidden_size))
+        self.position_embedding = nn.Parameter(
+            torch.zeros(1, self.num_patches + 1, hidden_size)
+        )
+
+        self.layers = nn.ModuleList(
+            [
+                InternVisionLayer(hidden_size, num_heads, intermediate_size)
+                for _ in range(num_layers)
+            ]
+        )
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        batch = pixel_values.shape[0]
+        # Cast pixel_values to match Conv2d weight dtype to avoid
+        # "Input type (BFloat16) and bias type (float) should be the same" error
+        # when HF weights are loaded in float32 but input arrives in bf16.
+        # The compiler's --auto-cast=matmult handles the actual mixed-precision math.
+        pixel_values = pixel_values.to(dtype=self.patch_embedding.weight.dtype)
+        patches = self.patch_embedding(pixel_values)
+        patches = patches.flatten(2).transpose(1, 2)
+        cls_tokens = self.class_embedding.expand(batch, -1, -1)
+        hidden_states = torch.cat([cls_tokens, patches], dim=1)
+        hidden_states = hidden_states + self.position_embedding
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        return hidden_states
+
+
+# ---------------------------------------------------------------------------
+# Full vision pipeline (encoder + pixel shuffle + projector + padding)
+# ---------------------------------------------------------------------------
+
+
+class NeuronInternVL3VisionModel(nn.Module):
+    """
+    Full InternVL3 vision pipeline for NxDI tracing.
+
+    Input: pixel_values [B, 3, 448, 448]
+    Output: [1, text_seq_len, text_hidden_size] (padded for scatter_by_index_put)
+
+    Pipeline:
+      1. InternViT encoder -> [B, 1025, 1024]
+      2. Strip CLS token -> [B, 1024, 1024]
+      3. Pixel shuffle (0.5x) -> [B, 256, 4096]
+      4. Projector MLP: LayerNorm -> Linear -> GELU -> Linear -> [B, 256, 3584]
+      5. Pad to text seq_len -> [1, seq_len, 3584]
+    """
+
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+
+        # Extract text seq_len for output padding
+        self.text_seq_len = config.text_config.neuron_config.seq_len
+        self.text_hidden_size = config.text_config.hidden_size
+        self.text_dtype = config.text_config.neuron_config.torch_dtype
+
+        # Vision encoder
+        self.encoder = InternVisionEncoder(
+            hidden_size=VISION_HIDDEN_SIZE,
+            num_heads=VISION_NUM_HEADS,
+            num_layers=VISION_NUM_LAYERS,
+            intermediate_size=VISION_INTERMEDIATE_SIZE,
+            patch_size=VISION_PATCH_SIZE,
+            image_size=VISION_IMAGE_SIZE,
+        )
+
+        # Projector MLP (pixel_shuffle output -> text hidden size)
+        proj_input_dim = int(VISION_HIDDEN_SIZE * (1.0 / DOWNSAMPLE_RATIO) ** 2)  # 4096
+        self.proj_norm = nn.LayerNorm(proj_input_dim)  # mlp1.0
+        self.proj_linear1 = nn.Linear(
+            proj_input_dim, TEXT_HIDDEN_SIZE, bias=True
+        )  # mlp1.1
+        # mlp1.2 is GELU (no weights)
+        self.proj_linear2 = nn.Linear(
+            TEXT_HIDDEN_SIZE, TEXT_HIDDEN_SIZE, bias=True
+        )  # mlp1.3
+
+    def pixel_shuffle(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Pixel shuffle downsampling (downsample_ratio=0.5).
+
+        Input: [B, H*W, C] where H=W=32, C=1024
+        Output: [B, (H/2)*(W/2), C*4] = [B, 256, 4096]
+        """
+        batch, n_patches, channels = x.shape
+        h = w = int(math.sqrt(n_patches))
+
+        x = x.reshape(batch, h, w, channels)
+
+        ratio = DOWNSAMPLE_RATIO
+        x = x.reshape(batch, h, int(w * ratio), int(channels / ratio))
+        x = x.permute(0, 2, 1, 3).contiguous()
+        x = x.reshape(
+            batch, int(h * ratio), int(w * ratio), int(channels / (ratio * ratio))
+        )
+        x = x.permute(0, 2, 1, 3).contiguous()
+
+        return x.reshape(batch, -1, x.shape[-1])
+
+    def pad_to_text_seq_len(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Pad vision embeddings to text model's sequence length.
+
+        Input: [B, 256, 3584]
+        Output: [1, text_seq_len, 3584] (zero-padded, batch=1)
+        """
+        hidden_states = hidden_states.to(self.text_dtype)
+        batch, n_tokens, hidden_size = hidden_states.shape
+
+        # Pad sequence dimension to text seq_len
+        if n_tokens < self.text_seq_len:
+            pad = torch.zeros(
+                batch,
+                self.text_seq_len - n_tokens,
+                hidden_size,
+                dtype=hidden_states.dtype,
+                device=hidden_states.device,
+            )
+            hidden_states = torch.cat([hidden_states, pad], dim=1)
+
+        # Reshape to [1, seq_len, hidden] (batch=1 for scatter)
+        hidden_states = hidden_states.view(-1, hidden_size).unsqueeze(0)
+        return hidden_states
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            pixel_values: [batch, 3, 448, 448]
+
+        Returns:
+            vision_embeddings: [1, text_seq_len, text_hidden_size] padded
+        """
+        # Encoder: [B, 1025, 1024]
+        hidden_states = self.encoder(pixel_values)
+
+        # Strip CLS: [B, 1024, 1024]
+        hidden_states = hidden_states[:, 1:, :]
+
+        # Pixel shuffle: [B, 256, 4096]
+        hidden_states = self.pixel_shuffle(hidden_states)
+
+        # Projector MLP: [B, 256, 3584]
+        hidden_states = self.proj_norm(hidden_states)
+        hidden_states = F.gelu(self.proj_linear1(hidden_states))
+        hidden_states = self.proj_linear2(hidden_states)
+
+        # Pad to text seq_len: [1, seq_len, 3584]
+        hidden_states = self.pad_to_text_seq_len(hidden_states)
+
+        return hidden_states
+
+
+# ---------------------------------------------------------------------------
+# Vision model wrapper (for NxDI tracing)
+# ---------------------------------------------------------------------------
+
+
+class InternVL3VisionModelWrapper(ModelWrapper):
+    """
+    Wrapper for tracing the InternVL3 vision encoder on Neuron.
+
+    Uses EncoderModelInstance (no KV cache).
+    Vision buckets represent number of images (always 1 for InternVL3).
+    """
+
+    def __init__(
+        self,
+        config: InferenceConfig,
+        model_cls,
+        tag="",
+        compiler_args=None,
+        priority_model_idx=None,
+        pipeline_execution=True,
+        return_ranked_to_cpu=False,
+        model_init_kwargs={},
+    ) -> None:
+        super().__init__(
+            config,
+            model_cls,
+            tag,
+            compiler_args,
+            priority_model_idx,
+            pipeline_execution,
+            return_ranked_to_cpu,
+            model_init_kwargs,
+        )
+
+    def input_generator(self) -> List[Tuple[torch.Tensor]]:
+        """
+        Generate example inputs for vision encoder tracing.
+
+        InternVL3 processes one 448x448 image at a time (no dynamic patching
+        at this stage). Single bucket with batch=1.
+        """
+        inputs = []
+        # Vision buckets = [1] (single image)
+        for bucket in self.config.vision_config.neuron_config.buckets:
+            pixel_values = torch.ones(
+                [bucket, 3, VISION_IMAGE_SIZE, VISION_IMAGE_SIZE],
+                dtype=self.config.vision_config.neuron_config.torch_dtype,
+            )
+            inputs.append((pixel_values,))
+        return inputs
+
+    def get_model_instance(self):
+        return EncoderModelInstance(model_cls=self.model_cls, config=self.config)
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """
+        Run vision encoder on Neuron.
+
+        Args:
+            pixel_values: [batch, 3, 448, 448]
+
+        Returns:
+            vision_embeddings: [1, text_seq_len, text_hidden_size]
+        """
+        if self.model is None:
+            raise RuntimeError(
+                "Forward called before load. Run load() or load_state_dict() first."
+            )
+        output = self._forward(pixel_values)
+        return output
+
+
+# ---------------------------------------------------------------------------
+# Vision weight conversion
+# ---------------------------------------------------------------------------
+
+
+def convert_vision_hf_to_neuron_state_dict(state_dict: dict) -> dict:
+    """
+    Convert InternVL3 vision + projector weights from HF to NxDI format.
+
+    HF keys:
+      vision_model.embeddings.class_embedding
+      vision_model.embeddings.patch_embedding.{weight,bias}
+      vision_model.embeddings.position_embedding
+      vision_model.encoder.layers.{i}.attn.qkv.{weight,bias}
+      vision_model.encoder.layers.{i}.attn.proj.{weight,bias}
+      vision_model.encoder.layers.{i}.ls1
+      vision_model.encoder.layers.{i}.ls2
+      vision_model.encoder.layers.{i}.norm1.{weight,bias}
+      vision_model.encoder.layers.{i}.norm2.{weight,bias}
+      vision_model.encoder.layers.{i}.mlp.fc1.{weight,bias}
+      vision_model.encoder.layers.{i}.mlp.fc2.{weight,bias}
+      mlp1.0.{weight,bias}  -> proj_norm
+      mlp1.1.{weight,bias}  -> proj_linear1
+      mlp1.3.{weight,bias}  -> proj_linear2
+
+    NxDI keys:
+      encoder.class_embedding
+      encoder.patch_embedding.{weight,bias}
+      encoder.position_embedding
+      encoder.layers.{i}.*
+      proj_norm.{weight,bias}
+      proj_linear1.{weight,bias}
+      proj_linear2.{weight,bias}
+    """
+    neuron_state_dict = {}
+
+    # Projector key mapping
+    PROJECTOR_MAP = {
+        "mlp1.0.weight": "proj_norm.weight",
+        "mlp1.0.bias": "proj_norm.bias",
+        "mlp1.1.weight": "proj_linear1.weight",
+        "mlp1.1.bias": "proj_linear1.bias",
+        "mlp1.3.weight": "proj_linear2.weight",
+        "mlp1.3.bias": "proj_linear2.bias",
+    }
+
+    for key, tensor in state_dict.items():
+        # Projector weights
+        if key in PROJECTOR_MAP:
+            neuron_state_dict[PROJECTOR_MAP[key]] = tensor.detach().clone()
+            continue
+
+        # Vision encoder embeddings
+        if key.startswith("vision_model.embeddings."):
+            suffix = key[len("vision_model.embeddings.") :]
+            neuron_state_dict[f"encoder.{suffix}"] = tensor.detach().clone()
+            continue
+
+        # Vision encoder layers
+        if key.startswith("vision_model.encoder.layers."):
+            suffix = key[len("vision_model.encoder.") :]
+            neuron_state_dict[f"encoder.{suffix}"] = tensor.detach().clone()
+            continue
+
+        # Skip other vision_model keys and all non-vision keys
+        # (text weights handled by text conversion)
+
+    return neuron_state_dict
diff --git a/contrib/models/InternVL3-8B-Instruct/test/__init__.py b/contrib/models/InternVL3-8B-Instruct/test/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/InternVL3-8B-Instruct/test/integration/__init__.py b/contrib/models/InternVL3-8B-Instruct/test/integration/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/InternVL3-8B-Instruct/test/integration/test_model.py b/contrib/models/InternVL3-8B-Instruct/test/integration/test_model.py
new file mode 100644
index 00000000..eafb79ad
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/test/integration/test_model.py
@@ -0,0 +1,242 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Integration tests for InternVL3-8B-Instruct NxDI contrib model.
+
+Validates Neuron model accuracy against CPU reference using logit_validation.
+
+Usage:
+    pytest test_model.py -v --tb=short
+
+Prerequisites:
+    - Model downloaded to MODEL_PATH
+    - Compiled model at COMPILED_MODEL_PATH (run compile_internvl3_vlm.py first)
+    - Neuron runtime available (trn2.3xlarge, LNC=2, TP=4)
+"""
+
+import json
+import math
+import os
+import sys
+from pathlib import Path
+
+import pytest
+import torch
+from torch_neuronx.testing.validation import logit_validation
+from transformers import AutoTokenizer, GenerationConfig
+
+from neuronx_distributed_inference.models.config import NeuronConfig
+from neuronx_distributed_inference.utils.accuracy import generate_expected_logits
+from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter
+
+# Import from src directory
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+from modeling_internvl3 import (
+    NeuronInternVL3ForCausalLM,
+    InternVL3InferenceConfig,
+)
+
+
+# Test configuration — override via environment variables if needed
+MODEL_PATH = os.environ.get(
+    "INTERNVL3_MODEL_PATH", "/mnt/models/InternVL3-8B-Instruct/"
+)
+COMPILED_MODEL_PATH = os.environ.get(
+    "INTERNVL3_COMPILED_PATH", "/mnt/models/neuron_models/InternVL3-8B-Instruct/"
+)
+NUM_TOKENS_TO_CHECK = 16
+TOKEN_DIVERGENCE_ATOL = 0.02
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def load_compiled_model(model_path: str, compiled_path: str):
+    """Load pre-compiled InternVL3 model for inference."""
+    config_path = Path(compiled_path) / "neuron_config.json"
+    if not config_path.exists():
+        raise FileNotFoundError(f"neuron_config.json not found at {config_path}")
+
+    with open(config_path) as f:
+        config_data = json.load(f)
+    nc_dict = config_data.get("neuron_config", config_data)
+
+    dtype_str = nc_dict.get("torch_dtype", "torch.bfloat16")
+    if isinstance(dtype_str, str):
+        dtype = (
+            getattr(torch, dtype_str.split(".")[-1])
+            if "torch" in dtype_str
+            else torch.bfloat16
+        )
+    else:
+        dtype = dtype_str
+
+    text_neuron_config = NeuronConfig(
+        tp_degree=nc_dict.get("tp_degree", 4),
+        max_batch_size=nc_dict.get("batch_size", nc_dict.get("max_batch_size", 1)),
+        seq_len=nc_dict.get("seq_len", 2048),
+        torch_dtype=dtype,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+
+    vision_neuron_config = NeuronConfig(
+        tp_degree=nc_dict.get("tp_degree", 4),
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=dtype,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        model_path,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    model = NeuronInternVL3ForCausalLM(model_path, config=config)
+    model.load(compiled_path)
+    return model
+
+
+def get_tp_aligned_vocab_size(vocab_size: int, tp_degree: int) -> int:
+    """Get the largest vocab index that's valid (not TP padding).
+
+    When vocab_size is not divisible by tp_degree, lm_head pads up to the
+    next multiple of tp_degree. The padded positions contain -FLT_MAX in
+    the Neuron output. Truncate to this boundary to avoid false failures.
+    """
+    return (vocab_size // tp_degree) * tp_degree
+
+
+# ---------------------------------------------------------------------------
+# Integration tests (requires pre-compiled Neuron model)
+# ---------------------------------------------------------------------------
+
+
+class TestInternVL3Integration:
+    """Integration tests for InternVL3-8B-Instruct on Neuron."""
+
+    @pytest.fixture(scope="class")
+    def neuron_model(self):
+        """Load pre-compiled Neuron model (shared across tests in this class)."""
+        compiled_path = Path(COMPILED_MODEL_PATH)
+        if not compiled_path.exists():
+            pytest.skip(f"Compiled model not found at {compiled_path}")
+        return load_compiled_model(MODEL_PATH, COMPILED_MODEL_PATH)
+
+    @pytest.fixture(scope="class")
+    def tokenizer(self):
+        """Load tokenizer."""
+        return AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    def test_config(self, neuron_model):
+        """Validate InternVL3 config matches expected Qwen2.5-7B architecture."""
+        config = neuron_model.config
+        assert config.hidden_size == 3584
+        assert config.num_attention_heads == 28
+        assert config.num_key_value_heads == 4
+        assert config.num_hidden_layers == 28
+        assert config.intermediate_size == 18944
+        assert config.vocab_size == 151674
+        assert config.rope_theta == 1000000.0
+
+    def test_text_logit_validation(self, neuron_model, tokenizer):
+        """
+        Validate text-only logit accuracy: Neuron vs CPU reference.
+
+        Uses generate_expected_logits() for CPU golden logits and
+        logit_validation() for multi-tier logit comparison.
+
+        InternVL3 vocab_size (151674) is not divisible by TP degree (4),
+        so lm_head pads to the next multiple (151676). The 2 padding
+        positions get -FLT_MAX in Neuron output. We truncate both CPU
+        and Neuron logits to the TP-aligned boundary (151672) to avoid
+        false failures from these padding artifacts.
+        """
+        prompt = "The capital of France is"
+        input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+        attention_mask = torch.ones_like(input_ids, dtype=torch.int32)
+
+        generation_config = GenerationConfig(
+            do_sample=False,
+            max_new_tokens=NUM_TOKENS_TO_CHECK,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.pad_token_id or 0,
+        )
+
+        # 1. Generate CPU reference logits
+        expected_logits = generate_expected_logits(
+            neuron_model=neuron_model,
+            input_ids=input_ids,
+            inputs_attention_mask=attention_mask,
+            generation_config=generation_config,
+            num_tokens=NUM_TOKENS_TO_CHECK,
+        )
+
+        # 2. Truncate expected logits to TP-aligned vocab boundary
+        tp_degree = neuron_model.config.neuron_config.tp_degree
+        vocab_size = neuron_model.config.vocab_size
+        aligned_vocab = get_tp_aligned_vocab_size(vocab_size, tp_degree)
+        expected_logits = expected_logits[..., :aligned_vocab]
+
+        # 3. Build generate_fn that returns Neuron logits (also truncated)
+        adapter = HuggingFaceGenerationAdapter(neuron_model)
+
+        expected_attention_mask = torch.ones(
+            (attention_mask.shape[0], expected_logits.shape[0]),
+            dtype=torch.int32,
+        )
+        extrapolated_attention_mask = torch.cat(
+            (attention_mask, expected_attention_mask), dim=1
+        )
+
+        def generate_fn(input_ids_tensor):
+            neuron_model.reset()
+            input_length = input_ids_tensor.shape[1]
+            attn_mask = extrapolated_attention_mask[:, :input_length]
+            new_tokens = NUM_TOKENS_TO_CHECK + input_ids.shape[1] - input_length
+
+            outputs = adapter.generate(
+                input_ids=input_ids_tensor,
+                attention_mask=attn_mask,
+                max_new_tokens=new_tokens,
+                min_new_tokens=new_tokens,
+                do_sample=False,
+                return_dict_in_generate=True,
+                output_scores=True,
+                generation_config=generation_config,
+            )
+            actual_logits = torch.stack(outputs.scores)
+            # Truncate to TP-aligned vocab boundary
+            actual_logits = actual_logits[..., :aligned_vocab]
+            return actual_logits
+
+        # 4. Run logit_validation with BF16-appropriate tolerances.
+        # Default tolerances (top-5: 0.01, top-50: 0.02) are calibrated for
+        # FP16/FP32 models. BF16 has lower mantissa precision (7 bits vs 10),
+        # so normalized logit errors of 0.03-0.06 are expected.
+        bf16_tol_map = {
+            5: (1e-5, 0.05),
+            50: (1e-5, 0.06),
+            1000: (1e-5, 0.06),
+            None: (1e-5, 0.08),
+        }
+
+        passed, results, status_msg = logit_validation(
+            input_ids=input_ids,
+            generate_fn=generate_fn,
+            expected_logits=expected_logits,
+            tol_map=bf16_tol_map,
+            divergence_difference_tol=TOKEN_DIVERGENCE_ATOL,
+        )
+
+        print(f"\n{status_msg}")
+        assert passed, f"Logit validation failed:\n{status_msg}"
diff --git a/contrib/models/InternVL3-8B-Instruct/test/unit/__init__.py b/contrib/models/InternVL3-8B-Instruct/test/unit/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/InternVL3-8B-Instruct/tp2_nki_sweep.py b/contrib/models/InternVL3-8B-Instruct/tp2_nki_sweep.py
new file mode 100644
index 00000000..c388ed58
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/tp2_nki_sweep.py
@@ -0,0 +1,407 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+TP=2 NKI kernel sweep for InternVL3-8B-Instruct.
+
+Tests all NKI kernel combinations at TP=2 on trn2.3xlarge LNC=2:
+  A) baseline        — no NKI kernels
+  B) qkv+mlp         — qkv_kernel_enabled + mlp_kernel_enabled
+  C) attn_block_tkg   — attn_block_tkg_nki_kernel_enabled (was NCC_ISTP902 at TP=4)
+  D) all_kernels      — qkv + mlp + attn_block_tkg
+
+Each config: compile → load → accuracy check → TTFT (5 runs) → TKG (3 runs, 64 tok).
+
+Usage:
+    python tp2_nki_sweep.py                   # Run all configs
+    python tp2_nki_sweep.py --configs baseline qkv_mlp   # Run specific configs
+    python tp2_nki_sweep.py --skip-compile    # Skip compilation (load from disk)
+
+Output: /mnt/models/tp2_nki_sweep_results.json
+"""
+
+import argparse
+import gc
+import json
+import os
+import sys
+import time
+import traceback
+from pathlib import Path
+
+import torch
+
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+BASE_COMPILED_DIR = "/mnt/models/neuron_models"
+RESULTS_PATH = "/mnt/models/tp2_nki_sweep_results.json"
+
+TP_DEGREE = 2
+SEQ_LEN = 2048
+BATCH_SIZE = 1
+
+# NKI kernel configurations to test
+CONFIGS = {
+    "baseline": {
+        "label": "Baseline (no NKI)",
+        "fused_qkv": False,
+        "qkv_kernel_enabled": False,
+        "mlp_kernel_enabled": False,
+        "attn_block_tkg_nki_kernel_enabled": False,
+    },
+    "qkv_mlp": {
+        "label": "QKV + MLP kernels",
+        "fused_qkv": True,
+        "qkv_kernel_enabled": True,
+        "mlp_kernel_enabled": True,
+        "attn_block_tkg_nki_kernel_enabled": False,
+    },
+    "attn_block": {
+        "label": "Attn Block TKG kernel",
+        "fused_qkv": True,
+        "qkv_kernel_enabled": True,  # Required by attn_block_tkg
+        "mlp_kernel_enabled": False,
+        "attn_block_tkg_nki_kernel_enabled": True,
+    },
+    "all_kernels": {
+        "label": "All NKI kernels",
+        "fused_qkv": True,
+        "qkv_kernel_enabled": True,
+        "mlp_kernel_enabled": True,
+        "attn_block_tkg_nki_kernel_enabled": True,
+    },
+}
+
+
+def compiled_path_for(config_name):
+    return f"{BASE_COMPILED_DIR}/InternVL3-8B-tp2-{config_name}"
+
+
+def create_config(config_name):
+    """Create inference config for a given NKI kernel combination."""
+    cfg = CONFIGS[config_name]
+
+    text_neuron_config = NeuronConfig(
+        tp_degree=TP_DEGREE,
+        max_batch_size=BATCH_SIZE,
+        seq_len=SEQ_LEN,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+        fused_qkv=cfg["fused_qkv"],
+        qkv_kernel_enabled=cfg["qkv_kernel_enabled"],
+        mlp_kernel_enabled=cfg["mlp_kernel_enabled"],
+        attn_block_tkg_nki_kernel_enabled=cfg["attn_block_tkg_nki_kernel_enabled"],
+    )
+
+    vision_neuron_config = NeuronConfig(
+        tp_degree=TP_DEGREE,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,  # Vision encoder always has fused QKV
+        save_sharded_checkpoint=True,
+    )
+
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+
+    return config
+
+
+def compile_and_load(config_name, config, skip_compile=False):
+    """Compile (or skip) and load model. Returns (model, compile_time) or raises."""
+    cpath = compiled_path_for(config_name)
+    compile_time = None
+
+    if not skip_compile:
+        print(f"\n--- Compiling: {CONFIGS[config_name]['label']} ---")
+        nc = config.text_config.neuron_config
+        print(f"  tp_degree={nc.tp_degree}, fused_qkv={nc.fused_qkv}")
+        print(
+            f"  qkv_kernel={nc.qkv_kernel_enabled}, mlp_kernel={nc.mlp_kernel_enabled}"
+        )
+        print(f"  attn_block_tkg={nc.attn_block_tkg_nki_kernel_enabled}")
+        print(f"  Output: {cpath}")
+
+        model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+        start = time.time()
+        model.compile(cpath)
+        compile_time = time.time() - start
+        print(f"  Compilation: {compile_time:.1f}s ({compile_time / 60:.1f} min)")
+
+        # Load compiled NEFFs and save sharded weights
+        model.load(cpath)
+    else:
+        print(f"\n--- Loading (skip-compile): {CONFIGS[config_name]['label']} ---")
+        model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+        model.load(cpath)
+
+    return model, compile_time
+
+
+def measure_ttft(model, tokenizer, n_runs=5):
+    """Measure TTFT for text-only CTE."""
+    messages = [{"role": "user", "content": "What is the capital of France?"}]
+    templated = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    inputs = tokenizer(templated, return_tensors="pt")
+    input_ids = inputs.input_ids
+    seq_len = input_ids.shape[-1]
+    position_ids = torch.arange(seq_len, dtype=torch.int32).unsqueeze(0)
+    seq_ids = torch.zeros(1, dtype=torch.int32)
+
+    # Warmup
+    model.reset()
+    with torch.no_grad():
+        model(input_ids=input_ids, position_ids=position_ids, seq_ids=seq_ids)
+
+    # Measure
+    times = []
+    for _ in range(n_runs):
+        model.reset()
+        start = time.perf_counter()
+        with torch.no_grad():
+            model(input_ids=input_ids, position_ids=position_ids, seq_ids=seq_ids)
+        times.append(time.perf_counter() - start)
+
+    return sum(times) / len(times) * 1000  # ms
+
+
+def measure_tkg(model, tokenizer, num_tokens=64, n_runs=3):
+    """Measure TKG throughput via HuggingFaceGenerationAdapter."""
+    adapter = HuggingFaceGenerationAdapter(model)
+    messages = [
+        {"role": "user", "content": "Tell me a detailed story about a brave knight."}
+    ]
+    templated = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    inputs = tokenizer(templated, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = torch.ones_like(input_ids)
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+
+    # Warmup
+    with torch.no_grad():
+        adapter.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=num_tokens,
+            do_sample=False,
+            eos_token_id=eos_token_id,
+        )
+
+    # Measure
+    results = []
+    for _ in range(n_runs):
+        start = time.perf_counter()
+        with torch.no_grad():
+            out = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_tokens,
+                do_sample=False,
+                eos_token_id=eos_token_id,
+            )
+        elapsed = time.perf_counter() - start
+        generated = out.shape[-1] - input_ids.shape[-1]
+        results.append((generated, elapsed))
+
+    avg_tok = sum(r[0] for r in results) / len(results)
+    avg_time = sum(r[1] for r in results) / len(results)
+    tok_s = avg_tok / avg_time if avg_time > 0 else 0.0
+    return tok_s, avg_tok, avg_time
+
+
+def run_accuracy_check(model, tokenizer):
+    """Quick accuracy sanity check — returns (pass, details)."""
+    adapter = HuggingFaceGenerationAdapter(model)
+    prompts = [
+        ("france", "What is the capital of France?"),
+        ("math", "What is 2 + 2? Answer with just the number:"),
+    ]
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    details = {}
+    for name, text in prompts:
+        messages = [{"role": "user", "content": text}]
+        templated = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        ids = tokenizer(templated, return_tensors="pt")
+        input_ids = ids.input_ids
+        with torch.no_grad():
+            out = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=torch.ones_like(input_ids),
+                max_new_tokens=32,
+                do_sample=False,
+                eos_token_id=eos_token_id,
+            )
+        gen = tokenizer.decode(out[0, input_ids.shape[-1] :], skip_special_tokens=True)
+        details[name] = gen
+        print(f"    {name}: {gen!r}")
+
+    ok = "paris" in details.get("france", "").lower() and "4" in details.get("math", "")
+    return ok, details
+
+
+def unload_model(model):
+    """Best-effort cleanup of a loaded Neuron model."""
+    try:
+        del model
+    except Exception:
+        pass
+    gc.collect()
+    # Short sleep to let the runtime release NC resources
+    time.sleep(3)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="TP=2 NKI kernel sweep")
+    parser.add_argument(
+        "--configs",
+        nargs="+",
+        choices=list(CONFIGS.keys()),
+        default=list(CONFIGS.keys()),
+        help="Which configs to test (default: all)",
+    )
+    parser.add_argument("--skip-compile", action="store_true", help="Skip compilation")
+    args = parser.parse_args()
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    all_results = {}
+
+    for config_name in args.configs:
+        cfg_info = CONFIGS[config_name]
+        print("\n" + "=" * 70)
+        print(f"CONFIG: {config_name} — {cfg_info['label']}")
+        print(f"  TP={TP_DEGREE}, seq_len={SEQ_LEN}, batch={BATCH_SIZE}")
+        print("=" * 70)
+
+        result = {
+            "label": cfg_info["label"],
+            "tp_degree": TP_DEGREE,
+            "seq_len": SEQ_LEN,
+            "batch_size": BATCH_SIZE,
+            "fused_qkv": cfg_info["fused_qkv"],
+            "qkv_kernel_enabled": cfg_info["qkv_kernel_enabled"],
+            "mlp_kernel_enabled": cfg_info["mlp_kernel_enabled"],
+            "attn_block_tkg_nki_kernel_enabled": cfg_info[
+                "attn_block_tkg_nki_kernel_enabled"
+            ],
+        }
+
+        try:
+            config = create_config(config_name)
+            model, compile_time = compile_and_load(
+                config_name, config, args.skip_compile
+            )
+            result["compile_time_s"] = round(compile_time, 1) if compile_time else None
+            result["compile_status"] = "PASS"
+
+            # Accuracy
+            print(f"\n  Accuracy check:")
+            acc_ok, acc_details = run_accuracy_check(model, tokenizer)
+            result["accuracy"] = "PASS" if acc_ok else "FAIL"
+            result["accuracy_details"] = acc_details
+            print(f"  Accuracy: {result['accuracy']}")
+
+            # TTFT
+            print(f"\n  TTFT benchmark (5 runs):")
+            ttft = measure_ttft(model, tokenizer, n_runs=5)
+            result["ttft_ms"] = round(ttft, 1)
+            print(f"  TTFT: {ttft:.1f} ms")
+
+            # TKG
+            print(f"\n  TKG benchmark (3 runs, 64 tok):")
+            tok_s, avg_tok, avg_time = measure_tkg(
+                model, tokenizer, num_tokens=64, n_runs=3
+            )
+            result["tkg_tok_s"] = round(tok_s, 1)
+            result["tkg_avg_tokens"] = round(avg_tok, 1)
+            result["tkg_avg_time_s"] = round(avg_time, 3)
+            print(f"  TKG: {tok_s:.1f} tok/s ({avg_tok:.0f} tokens in {avg_time:.3f}s)")
+
+            result["status"] = "PASS"
+
+        except Exception as e:
+            result["status"] = "FAIL"
+            result["error"] = str(e)
+            result["traceback"] = traceback.format_exc()
+            print(f"\n  FAILED: {e}")
+            # Print full traceback for debugging
+            traceback.print_exc()
+
+        all_results[config_name] = result
+
+        # Attempt to unload before next config
+        try:
+            unload_model(model)
+        except Exception:
+            pass
+
+        # Save intermediate results after each config
+        with open(RESULTS_PATH, "w") as f:
+            json.dump(all_results, f, indent=2)
+        print(f"\n  Intermediate results saved to {RESULTS_PATH}")
+
+    # Final summary
+    print("\n\n" + "=" * 70)
+    print(f"TP={TP_DEGREE} NKI KERNEL SWEEP — FINAL SUMMARY")
+    print("=" * 70)
+
+    # Find baseline for delta computation
+    baseline_ttft = all_results.get("baseline", {}).get("ttft_ms")
+    baseline_tkg = all_results.get("baseline", {}).get("tkg_tok_s")
+
+    header = f"{'Config':<20} {'Status':<8} {'Accuracy':<8} {'TTFT(ms)':<10} {'TKG(tok/s)':<12} {'TTFT Δ':<10} {'TKG Δ':<10}"
+    print(header)
+    print("-" * len(header))
+
+    for name in args.configs:
+        r = all_results.get(name, {})
+        status = r.get("status", "N/A")
+        acc = r.get("accuracy", "N/A")
+        ttft = r.get("ttft_ms", None)
+        tkg = r.get("tkg_tok_s", None)
+
+        ttft_str = f"{ttft:.1f}" if ttft else "N/A"
+        tkg_str = f"{tkg:.1f}" if tkg else "N/A"
+
+        if baseline_ttft and ttft and name != "baseline":
+            ttft_d = ((ttft - baseline_ttft) / baseline_ttft) * 100
+            ttft_delta = f"{ttft_d:+.1f}%"
+        else:
+            ttft_delta = "—"
+
+        if baseline_tkg and tkg and name != "baseline":
+            tkg_d = ((tkg - baseline_tkg) / baseline_tkg) * 100
+            tkg_delta = f"{tkg_d:+.1f}%"
+        else:
+            tkg_delta = "—"
+
+        print(
+            f"{name:<20} {status:<8} {acc:<8} {ttft_str:<10} {tkg_str:<12} {ttft_delta:<10} {tkg_delta:<10}"
+        )
+
+    print(f"\nFull results: {RESULTS_PATH}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/InternVL3-8B-Instruct/validate_cte.py b/contrib/models/InternVL3-8B-Instruct/validate_cte.py
new file mode 100644
index 00000000..5feae62e
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/validate_cte.py
@@ -0,0 +1,617 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Task 009: CTE (Prefill) Validation for InternVL3-8B-Instruct on Neuron.
+
+Compares Neuron VLM CTE logits against CPU reference outputs.
+
+Tests:
+1. Full logit tensor comparison: "I believe the meaning of life is"
+   - Per-position top-1 match
+   - Cosine similarity per position
+   - Max absolute error
+2. Text-only top-1 match: "What is the capital of France?"
+3. Multimodal: "<image>\nDescribe this image briefly." with test_image.png
+
+Usage:
+    python validate_cte.py [--skip-compile]
+"""
+
+import json
+import sys
+import time
+from pathlib import Path
+
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_PATH = "/mnt/models/neuron_models/InternVL3-8B-VLM/"
+
+# Reference files from Task 006
+REF_LOGITS_PT = "/mnt/models/reference_logits.pt"
+REF_LOGITS_JSON = "/mnt/models/reference_logits.json"
+REF_TEXT_ONLY_JSON = "/mnt/models/reference_text_only.json"
+REF_IMAGE_JSON = "/mnt/models/reference_image.json"
+TEST_IMAGE = "/mnt/models/test_image.png"
+
+
+def create_config():
+    """Create InternVL3 VLM inference config."""
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=2048,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+    return config
+
+
+def load_model(skip_compile=False):
+    """Load compiled VLM model."""
+    config = create_config()
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+
+    if not skip_compile:
+        print("Compiling model...")
+        model.compile(COMPILED_PATH)
+
+    print("Loading model...")
+    start = time.time()
+    model.load(COMPILED_PATH)
+    print(f"Load completed in {time.time() - start:.1f}s")
+    return model
+
+
+def run_cte(model, input_ids, pixel_values=None):
+    """Run CTE (prefill) and return raw logits from the NEFF output."""
+    seq_len = input_ids.shape[-1]
+    position_ids = torch.arange(seq_len, dtype=torch.int32).unsqueeze(0)
+    seq_ids = torch.zeros(1, dtype=torch.int32)
+
+    with torch.no_grad():
+        outputs = model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+            pixel_values=pixel_values,
+        )
+    return outputs
+
+
+def validate_logits(test_name, neuron_logits, ref_logits, input_ids, tokenizer):
+    """
+    Compare Neuron CTE logits against CPU reference.
+
+    Returns dict of metrics and pass/fail status.
+    """
+    results = {"test": test_name, "pass": True, "details": []}
+    n_positions = input_ids.shape[-1]
+
+    # Neuron output may only have last-position logits (on-device sampling disabled)
+    # or full sequence logits depending on CTE implementation
+    neuron_f = neuron_logits.float()
+    ref_f = ref_logits.float()
+
+    # Check shapes
+    results["neuron_shape"] = list(neuron_f.shape)
+    results["ref_shape"] = list(ref_f.shape)
+
+    # NxDI CTE typically returns [batch, 1, vocab] (last position only)
+    # Reference has [batch, seq_len, vocab] (all positions)
+    # Detect this case and compare last positions
+    neuron_seq_len = neuron_f.shape[1] if neuron_f.dim() == 3 else 1
+    ref_seq_len = ref_f.shape[1] if ref_f.dim() == 3 else 1
+
+    if neuron_seq_len < ref_seq_len:
+        # Neuron returns fewer positions (typically just the last one)
+        # Compare against the last position of reference
+        print(
+            f"  Note: Neuron returns {neuron_seq_len} positions, reference has {ref_seq_len}. Comparing last position only."
+        )
+
+        neuron_last = neuron_f[0, -1] if neuron_f.dim() == 3 else neuron_f[-1]
+        ref_last = ref_f[0, -1] if ref_f.dim() == 3 else ref_f[-1]
+
+        # Neuron CTE may pad unused logit positions with -FLT_MAX (~-3.4e38).
+        # Mask these out for cosine similarity computation.
+        valid_mask = neuron_last > -1e30
+        neuron_valid = neuron_last[valid_mask]
+        ref_valid = ref_last[valid_mask]
+        n_valid = valid_mask.sum().item()
+        n_total = neuron_last.numel()
+        results["n_valid_logits"] = n_valid
+        results["n_total_logits"] = n_total
+        if n_valid < n_total:
+            print(
+                f"  Note: {n_total - n_valid} logit positions masked (padding -FLT_MAX)"
+            )
+
+        # Top-1 match
+        neuron_top1 = neuron_last.argmax().item()
+        ref_top1 = ref_last.argmax().item()
+        match = neuron_top1 == ref_top1
+        results["last_pos_top1_match"] = match
+        results["neuron_top1"] = neuron_top1
+        results["ref_top1"] = ref_top1
+        results["neuron_top1_token"] = tokenizer.decode([neuron_top1])
+        results["ref_top1_token"] = tokenizer.decode([ref_top1])
+        if not match:
+            results["pass"] = False
+            results["details"].append(
+                f"Top-1 mismatch at last pos: neuron={neuron_top1} "
+                f"({tokenizer.decode([neuron_top1])!r}) vs "
+                f"ref={ref_top1} ({tokenizer.decode([ref_top1])!r})"
+            )
+
+        # Cosine similarity at last position (using valid logits only)
+        if n_valid > 0:
+            cos_sim = torch.nn.functional.cosine_similarity(
+                neuron_valid.unsqueeze(0), ref_valid.unsqueeze(0)
+            ).item()
+        else:
+            cos_sim = 0.0
+        results["last_pos_cosine_sim"] = cos_sim
+        if cos_sim < 0.95:
+            results["pass"] = False
+            results["details"].append(f"Cosine sim too low at last pos: {cos_sim:.6f}")
+
+        # Top-5 overlap
+        neuron_top5 = set(torch.topk(neuron_last, 5).indices.tolist())
+        ref_top5 = set(torch.topk(ref_last, 5).indices.tolist())
+        overlap = len(neuron_top5 & ref_top5)
+        results["last_pos_top5_overlap"] = f"{overlap}/5"
+
+        # Max abs error on valid (non-padding) logits
+        if n_valid > 0:
+            max_abs_err = (neuron_valid - ref_valid).abs().max().item()
+            results["max_abs_error_valid"] = max_abs_err
+
+        # Top-10 overlap
+        neuron_top10 = set(torch.topk(neuron_last, 10).indices.tolist())
+        ref_top10 = set(torch.topk(ref_last, 10).indices.tolist())
+        overlap10 = len(neuron_top10 & ref_top10)
+        results["last_pos_top10_overlap"] = f"{overlap10}/10"
+
+        print(
+            f"  [Last pos] top1_match={match}, cosine={cos_sim:.6f}, "
+            f"top5_overlap={overlap}/5, top10_overlap={overlap10}/10"
+        )
+        if n_valid > 0:
+            print(f"  max_abs_error (valid logits): {max_abs_err:.4f}")
+
+    elif neuron_seq_len == ref_seq_len and neuron_f.dim() == 3 and ref_f.dim() == 3:
+        # Full sequence logits available — compare per-position
+        n_compare = min(neuron_f.shape[1], ref_f.shape[1])
+        top1_matches = 0
+        cos_sims = []
+
+        for pos in range(n_compare):
+            n_pos = neuron_f[0, pos]
+            r_pos = ref_f[0, pos]
+
+            # Top-1 match
+            n_top1 = n_pos.argmax().item()
+            r_top1 = r_pos.argmax().item()
+            if n_top1 == r_top1:
+                top1_matches += 1
+
+            # Cosine similarity
+            cos = torch.nn.functional.cosine_similarity(
+                n_pos.unsqueeze(0), r_pos.unsqueeze(0)
+            ).item()
+            cos_sims.append(cos)
+
+            # Detailed per-position log
+            n_token = tokenizer.decode([n_top1])
+            r_token = tokenizer.decode([r_top1])
+            match_str = "MATCH" if n_top1 == r_top1 else "MISMATCH"
+            print(
+                f"  pos {pos}: {match_str} | neuron={n_top1} ({n_token!r}) "
+                f"ref={r_top1} ({r_token!r}) | cosine={cos:.6f}"
+            )
+
+        top1_rate = top1_matches / n_compare
+        avg_cos = sum(cos_sims) / len(cos_sims)
+        min_cos = min(cos_sims)
+
+        results["top1_match_rate"] = f"{top1_matches}/{n_compare} ({top1_rate:.1%})"
+        results["avg_cosine_sim"] = avg_cos
+        results["min_cosine_sim"] = min_cos
+
+        # Max absolute error
+        max_abs_err = (
+            (neuron_f[0, :n_compare] - ref_f[0, :n_compare]).abs().max().item()
+        )
+        results["max_abs_error"] = max_abs_err
+
+        print(
+            f"\n  Summary: top1={top1_matches}/{n_compare} ({top1_rate:.1%}), "
+            f"avg_cos={avg_cos:.6f}, min_cos={min_cos:.6f}, max_abs_err={max_abs_err:.4f}"
+        )
+
+        if top1_rate < 0.7:
+            results["pass"] = False
+            results["details"].append(f"Top-1 match rate too low: {top1_rate:.1%}")
+        if min_cos < 0.90:
+            results["pass"] = False
+            results["details"].append(f"Min cosine sim too low: {min_cos:.6f}")
+    else:
+        # Shape mismatch — just compare what we can
+        results["details"].append(
+            f"Shape mismatch: neuron={neuron_f.shape} vs ref={ref_f.shape}"
+        )
+        # Fall back to comparing last-position argmax
+        if neuron_f.dim() >= 2:
+            neuron_last = neuron_f[0, -1] if neuron_f.dim() == 3 else neuron_f[-1]
+        else:
+            neuron_last = neuron_f
+        ref_last = ref_f[0, -1] if ref_f.dim() == 3 else ref_f[-1]
+
+        neuron_top1 = neuron_last.argmax().item()
+        ref_top1 = ref_last.argmax().item()
+        match = neuron_top1 == ref_top1
+        results["last_pos_top1_match"] = match
+        results["neuron_top1"] = neuron_top1
+        results["ref_top1"] = ref_top1
+        print(f"  Shape mismatch fallback: top1_match={match}")
+
+    return results
+
+
+def test_1_full_logits(model, tokenizer):
+    """Test 1: Full logit tensor comparison."""
+    print("\n" + "=" * 60)
+    print("TEST 1: Full logit tensor comparison")
+    print("=" * 60)
+
+    # Load reference
+    with open(REF_LOGITS_JSON) as f:
+        ref_meta = json.load(f)
+    ref_logits = torch.load(REF_LOGITS_PT, weights_only=True)
+
+    prompt = ref_meta["prompt"]
+    ref_input_ids = ref_meta["input_ids"]
+    print(f"Prompt: {prompt!r}")
+    print(f"Reference input_ids: {ref_input_ids}")
+    print(f"Reference logits shape: {ref_logits.shape}")
+
+    # Tokenize and verify
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    print(f"Neuron input_ids: {input_ids[0].tolist()}")
+
+    if input_ids[0].tolist() != ref_input_ids:
+        print("WARNING: Input IDs differ from reference!")
+        print(f"  Neuron: {input_ids[0].tolist()}")
+        print(f"  Ref:    {ref_input_ids}")
+
+    # Run CTE
+    outputs = run_cte(model, input_ids)
+
+    # Extract logits
+    if hasattr(outputs, "logits"):
+        neuron_logits = outputs.logits
+    elif isinstance(outputs, tuple):
+        neuron_logits = outputs[0]
+    else:
+        neuron_logits = outputs
+
+    print(f"Neuron output shape: {neuron_logits.shape}")
+
+    return validate_logits(
+        "full_logits", neuron_logits, ref_logits, input_ids, tokenizer
+    )
+
+
+def test_2_text_only(model, tokenizer):
+    """Test 2: Text-only top-1 validation."""
+    print("\n" + "=" * 60)
+    print("TEST 2: Text-only top-1 validation")
+    print("=" * 60)
+
+    with open(REF_TEXT_ONLY_JSON) as f:
+        ref = json.load(f)
+
+    prompt = ref["prompt"]
+    ref_top5 = ref["top5_tokens"]  # [[id, logit], ...]
+    ref_top1_id = ref_top5[0][0]
+    ref_top1_logit = ref_top5[0][1]
+
+    print(f"Prompt: {prompt!r}")
+    print(
+        f"Reference top-1: id={ref_top1_id} ({tokenizer.decode([ref_top1_id])!r}), logit={ref_top1_logit}"
+    )
+
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    print(f"Input IDs shape: {input_ids.shape}")
+
+    outputs = run_cte(model, input_ids)
+
+    if hasattr(outputs, "logits"):
+        logits = outputs.logits
+    elif isinstance(outputs, tuple):
+        logits = outputs[0]
+    else:
+        logits = outputs
+
+    # Get last position logits
+    if logits.dim() == 3:
+        last_logits = logits[0, -1].float()
+    elif logits.dim() == 2:
+        last_logits = logits[-1].float()
+    else:
+        last_logits = logits.float()
+
+    neuron_top5 = torch.topk(last_logits, 5)
+    neuron_top1_id = neuron_top5.indices[0].item()
+    neuron_top1_logit = neuron_top5.values[0].item()
+
+    print(f"\nNeuron top-5:")
+    for i, (tid, tval) in enumerate(zip(neuron_top5.indices, neuron_top5.values)):
+        token = tokenizer.decode([tid.item()])
+        ref_match = " ← MATCH" if tid.item() == ref_top5[i][0] else ""
+        print(f"  {i}: id={tid.item()} ({token!r}), logit={tval.item():.4f}{ref_match}")
+
+    match = neuron_top1_id == ref_top1_id
+    result = {
+        "test": "text_only_top1",
+        "pass": match,
+        "prompt": prompt,
+        "neuron_top1": neuron_top1_id,
+        "neuron_top1_token": tokenizer.decode([neuron_top1_id]),
+        "neuron_top1_logit": neuron_top1_logit,
+        "ref_top1": ref_top1_id,
+        "ref_top1_token": tokenizer.decode([ref_top1_id]),
+        "ref_top1_logit": ref_top1_logit,
+        "details": [],
+    }
+
+    if not match:
+        result["details"].append(
+            f"Top-1 mismatch: neuron={neuron_top1_id} ({tokenizer.decode([neuron_top1_id])!r}) "
+            f"vs ref={ref_top1_id} ({tokenizer.decode([ref_top1_id])!r})"
+        )
+
+    # Top-5 overlap
+    neuron_top5_set = set(neuron_top5.indices.tolist())
+    ref_top5_set = set(t[0] for t in ref_top5)
+    overlap = len(neuron_top5_set & ref_top5_set)
+    result["top5_overlap"] = f"{overlap}/5"
+
+    status = "PASS" if match else "FAIL"
+    print(f"\n  Result: {status} | top1_match={match}, top5_overlap={overlap}/5")
+
+    return result
+
+
+def test_3_multimodal(model, tokenizer):
+    """Test 3: Multimodal CTE validation."""
+    print("\n" + "=" * 60)
+    print("TEST 3: Multimodal CTE validation")
+    print("=" * 60)
+
+    with open(REF_IMAGE_JSON) as f:
+        ref = json.load(f)
+
+    # Check for test image
+    import os
+
+    if not os.path.exists(TEST_IMAGE):
+        print(
+            f"WARNING: Test image not found at {TEST_IMAGE}, using random pixel values"
+        )
+        pixel_values = torch.randn(1, 3, 448, 448)
+        has_real_image = False
+    else:
+        # Load the test image through the proper preprocessing pipeline
+        try:
+            from PIL import Image
+            from torchvision import transforms
+
+            img = Image.open(TEST_IMAGE).convert("RGB")
+            # InternVL3 uses 448x448 with ImageNet normalization
+            transform = transforms.Compose(
+                [
+                    transforms.Resize((448, 448)),
+                    transforms.ToTensor(),
+                    transforms.Normalize(
+                        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+                    ),
+                ]
+            )
+            pixel_values = transform(img).unsqueeze(0)
+            has_real_image = True
+            print(f"Loaded test image: {TEST_IMAGE}")
+        except Exception as e:
+            print(f"WARNING: Failed to load image ({e}), using random pixel values")
+            pixel_values = torch.randn(1, 3, 448, 448)
+            has_real_image = False
+
+    print(f"Pixel values shape: {pixel_values.shape}")
+
+    # Build input with image tokens
+    IMG_CONTEXT_ID = 151667
+    IMG_START_ID = 151665
+    IMG_END_ID = 151666
+
+    text_before = "Describe this image briefly."
+    text_ids = tokenizer(text_before, return_tensors="pt").input_ids[0]
+
+    img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+    full_ids = torch.cat(
+        [
+            text_ids,
+            torch.tensor([IMG_START_ID]),
+            img_tokens,
+            torch.tensor([IMG_END_ID]),
+        ]
+    ).unsqueeze(0)
+
+    print(f"Input IDs shape: {full_ids.shape}")
+    print(f"IMG_CONTEXT tokens: {(full_ids == IMG_CONTEXT_ID).sum().item()}")
+
+    outputs = run_cte(model, full_ids, pixel_values=pixel_values)
+
+    if hasattr(outputs, "logits"):
+        logits = outputs.logits
+    elif isinstance(outputs, tuple):
+        logits = outputs[0]
+    else:
+        logits = outputs
+
+    # Get last position
+    if logits.dim() == 3:
+        last_logits = logits[0, -1].float()
+    elif logits.dim() == 2:
+        last_logits = logits[-1].float()
+    else:
+        last_logits = logits.float()
+
+    top5 = torch.topk(last_logits, 5)
+    print(f"\nNeuron multimodal top-5:")
+    for i, (tid, tval) in enumerate(zip(top5.indices, top5.values)):
+        token = tokenizer.decode([tid.item()])
+        print(f"  {i}: id={tid.item()} ({token!r}), logit={tval.item():.4f}")
+
+    top1_id = top5.indices[0].item()
+    top1_token = tokenizer.decode([top1_id])
+
+    result = {
+        "test": "multimodal_cte",
+        "pass": True,  # No exact reference for multimodal -- check sanity
+        "has_real_image": has_real_image,
+        "neuron_top1": top1_id,
+        "neuron_top1_token": top1_token,
+        "pixel_values_shape": list(pixel_values.shape),
+        "input_ids_shape": list(full_ids.shape),
+        "details": [],
+    }
+
+    # Sanity checks
+    # 1. Output should not be degenerate (all same logit)
+    # Filter out padding values (-FLT_MAX) before computing std
+    valid_logits = last_logits[last_logits > -1e30]
+    logit_std = valid_logits.std().item() if valid_logits.numel() > 0 else 0.0
+    result["logit_std"] = logit_std
+    if logit_std < 0.1:
+        result["pass"] = False
+        result["details"].append(f"Degenerate logits: std={logit_std:.6f}")
+
+    # 2. Top-1 should be a reasonable token (not padding or special)
+    if top1_id in [0, 1, 2]:
+        result["pass"] = False
+        result["details"].append(f"Top-1 is special token: {top1_id}")
+
+    # 3. If real image, compare with reference response start
+    if has_real_image:
+        ref_response = ref.get("response", "")
+        ref_first_word = ref_response.split()[0] if ref_response else ""
+        result["ref_first_word"] = ref_first_word
+        # Check if top-1 token is consistent with reference response start
+        if ref_first_word and top1_token.strip().lower() == ref_first_word.lower():
+            result["first_word_match"] = True
+        else:
+            result["first_word_match"] = False
+            result["details"].append(
+                f"First word mismatch: neuron={top1_token!r} vs ref={ref_first_word!r} "
+                f"(may be OK due to pixel value preprocessing differences)"
+            )
+
+    status = "PASS" if result["pass"] else "FAIL"
+    print(f"\n  Result: {status} | top1={top1_token!r}, logit_std={logit_std:.4f}")
+
+    return result
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--skip-compile",
+        action="store_true",
+        help="Skip compilation (use existing compiled model)",
+    )
+    args = parser.parse_args()
+
+    print("=" * 60)
+    print("Task 009: CTE Validation - InternVL3-8B-Instruct")
+    print("=" * 60)
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    model = load_model(skip_compile=args.skip_compile)
+
+    all_results = []
+
+    # Test 1: Full logit tensor comparison
+    r1 = test_1_full_logits(model, tokenizer)
+    all_results.append(r1)
+
+    # Test 2: Text-only top-1
+    r2 = test_2_text_only(model, tokenizer)
+    all_results.append(r2)
+
+    # Test 3: Multimodal
+    r3 = test_3_multimodal(model, tokenizer)
+    all_results.append(r3)
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("VALIDATION SUMMARY")
+    print("=" * 60)
+    all_pass = True
+    for r in all_results:
+        status = "PASS" if r["pass"] else "FAIL"
+        print(f"  {r['test']}: {status}")
+        if r["details"]:
+            for d in r["details"]:
+                print(f"    - {d}")
+        if not r["pass"]:
+            all_pass = False
+
+    print()
+    if all_pass:
+        print("OVERALL: ALL TESTS PASSED")
+    else:
+        print("OVERALL: SOME TESTS FAILED")
+
+    # Save results
+    results_path = "/mnt/models/validation_cte_results.json"
+    with open(results_path, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {results_path}")
+
+    return 0 if all_pass else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/contrib/models/InternVL3-8B-Instruct/validate_tkg.py b/contrib/models/InternVL3-8B-Instruct/validate_tkg.py
new file mode 100644
index 00000000..a3fb46db
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/validate_tkg.py
@@ -0,0 +1,501 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Task 010: TKG (Token Generation) Validation for InternVL3-8B-Instruct on Neuron.
+
+Validates the full CTE→TKG autoregressive generation loop using
+HuggingFaceGenerationAdapter for clean multi-token decoding.
+
+Tests:
+1. Text-only generation: "What is the capital of France?" → compare first 20 tokens
+2. Text-only generation: "I believe the meaning of life is" → compare first 20 tokens
+3. Multimodal generation: image + "Describe this image briefly." → coherence check
+4. State reset: consecutive generations produce correct output
+5. EOS handling: generation stops at EOS token
+
+Usage:
+    python validate_tkg.py [--skip-compile] [--max-new-tokens 32]
+"""
+
+import json
+import sys
+import time
+from pathlib import Path
+
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent / "src"))
+
+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"
+COMPILED_PATH = "/mnt/models/neuron_models/InternVL3-8B-VLM/"
+
+# Reference files
+REF_TEXT_ONLY_JSON = "/mnt/models/reference_text_only.json"
+REF_LOGITS_JSON = "/mnt/models/reference_logits.json"
+REF_IMAGE_JSON = "/mnt/models/reference_image.json"
+TEST_IMAGE = "/mnt/models/test_image.png"
+
+
+def create_config():
+    """Create InternVL3 VLM inference config."""
+    text_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=2048,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        save_sharded_checkpoint=True,
+    )
+    vision_neuron_config = NeuronConfig(
+        tp_degree=4,
+        max_batch_size=1,
+        seq_len=256,
+        torch_dtype=torch.bfloat16,
+        on_device_sampling_config=None,
+        buckets=[1],
+        fused_qkv=True,
+        save_sharded_checkpoint=True,
+    )
+    config = InternVL3InferenceConfig.from_pretrained(
+        MODEL_PATH,
+        text_neuron_config=text_neuron_config,
+        vision_neuron_config=vision_neuron_config,
+    )
+    return config
+
+
+def load_model(skip_compile=False):
+    """Load compiled VLM model."""
+    config = create_config()
+    model = NeuronInternVL3ForCausalLM(MODEL_PATH, config=config)
+
+    if not skip_compile:
+        print("Compiling model...")
+        model.compile(COMPILED_PATH)
+
+    print("Loading model...")
+    start = time.time()
+    model.load(COMPILED_PATH)
+    print(f"Load completed in {time.time() - start:.1f}s")
+    return model
+
+
+def generate_text(
+    adapter, tokenizer, input_ids, attention_mask, max_new_tokens=32, **kwargs
+):
+    """Generate tokens using HuggingFaceGenerationAdapter."""
+    with torch.no_grad():
+        output_ids = adapter.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,  # greedy decoding for deterministic output
+            **kwargs,
+        )
+    return output_ids
+
+
+def compute_token_match_rate(neuron_ids, ref_text, tokenizer, n_compare=None):
+    """Compute token-level match rate between Neuron output and reference text."""
+    ref_ids = tokenizer(ref_text, return_tensors="pt").input_ids[0]
+
+    if n_compare is None:
+        n_compare = min(len(neuron_ids), len(ref_ids))
+
+    matches = 0
+    details = []
+    for i in range(n_compare):
+        n_id = neuron_ids[i].item() if i < len(neuron_ids) else -1
+        r_id = ref_ids[i].item() if i < len(ref_ids) else -1
+        match = n_id == r_id
+        if match:
+            matches += 1
+        n_tok = tokenizer.decode([n_id]) if n_id >= 0 else "<N/A>"
+        r_tok = tokenizer.decode([r_id]) if r_id >= 0 else "<N/A>"
+        details.append(
+            {
+                "pos": i,
+                "match": match,
+                "neuron_id": n_id,
+                "neuron_token": n_tok,
+                "ref_id": r_id,
+                "ref_token": r_tok,
+            }
+        )
+
+    rate = matches / n_compare if n_compare > 0 else 0
+    return rate, matches, n_compare, details
+
+
+def test_1_text_generation(adapter, tokenizer, max_new_tokens):
+    """Test 1: Text-only generation - capital of France."""
+    print("\n" + "=" * 60)
+    print("TEST 1: Text-only generation (capital of France)")
+    print("=" * 60)
+
+    with open(REF_TEXT_ONLY_JSON) as f:
+        ref = json.load(f)
+
+    prompt = ref["prompt"]
+    ref_response = ref["response"]
+    print(f"Prompt: {prompt!r}")
+    print(f"Reference response: {ref_response[:100]!r}...")
+
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = inputs.attention_mask
+    prompt_len = input_ids.shape[-1]
+
+    print(f"Input IDs: {input_ids[0].tolist()}")
+    print(f"Generating {max_new_tokens} tokens (greedy)...")
+
+    start = time.time()
+    output_ids = generate_text(
+        adapter, tokenizer, input_ids, attention_mask, max_new_tokens
+    )
+    gen_time = time.time() - start
+
+    generated_ids = output_ids[0, prompt_len:]
+    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+    full_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+
+    print(f"\nGenerated ({gen_time:.2f}s, {len(generated_ids)} tokens):")
+    print(f"  {generated_text!r}")
+
+    # Compare against reference
+    rate, matches, n_compare, details = compute_token_match_rate(
+        generated_ids, ref_response, tokenizer, n_compare=min(20, len(generated_ids))
+    )
+
+    print(f"\nToken match rate (first {n_compare}): {matches}/{n_compare} ({rate:.1%})")
+    for d in details[:20]:
+        match_str = "MATCH" if d["match"] else "MISS "
+        print(
+            f"  pos {d['pos']:2d}: {match_str} | neuron={d['neuron_token']!r:10s} ref={d['ref_token']!r:10s}"
+        )
+
+    result = {
+        "test": "text_generation_france",
+        "pass": True,
+        "prompt": prompt,
+        "generated_text": generated_text[:200],
+        "ref_response": ref_response[:200],
+        "n_generated": len(generated_ids),
+        "gen_time_s": round(gen_time, 2),
+        "token_match_rate": f"{matches}/{n_compare} ({rate:.1%})",
+        "details": [],
+    }
+
+    # Check coherence
+    if "Paris" not in full_text and "paris" not in full_text.lower():
+        result["pass"] = False
+        result["details"].append("Response doesn't mention Paris")
+
+    if rate < 0.5:
+        result["details"].append(f"Token match rate below 50%: {rate:.1%}")
+        # Don't fail on match rate alone — bf16 can diverge after a few tokens
+
+    status = "PASS" if result["pass"] else "FAIL"
+    print(f"\n  Result: {status}")
+    return result
+
+
+def test_2_text_generation_life(adapter, tokenizer, max_new_tokens):
+    """Test 2: Text-only generation - meaning of life."""
+    print("\n" + "=" * 60)
+    print("TEST 2: Text-only generation (meaning of life)")
+    print("=" * 60)
+
+    with open(REF_LOGITS_JSON) as f:
+        ref = json.load(f)
+
+    prompt = ref["prompt"]
+    ref_response = ref["full_response"]
+    print(f"Prompt: {prompt!r}")
+    print(f"Reference response: {ref_response[:100]!r}...")
+
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = inputs.attention_mask
+    prompt_len = input_ids.shape[-1]
+
+    print(f"Generating {max_new_tokens} tokens (greedy)...")
+
+    start = time.time()
+    output_ids = generate_text(
+        adapter, tokenizer, input_ids, attention_mask, max_new_tokens
+    )
+    gen_time = time.time() - start
+
+    generated_ids = output_ids[0, prompt_len:]
+    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+
+    print(f"\nGenerated ({gen_time:.2f}s, {len(generated_ids)} tokens):")
+    print(f"  {generated_text!r}")
+
+    # Token match
+    rate, matches, n_compare, details = compute_token_match_rate(
+        generated_ids, ref_response, tokenizer, n_compare=min(20, len(generated_ids))
+    )
+
+    print(f"\nToken match rate (first {n_compare}): {matches}/{n_compare} ({rate:.1%})")
+    for d in details[:10]:
+        match_str = "MATCH" if d["match"] else "MISS "
+        print(
+            f"  pos {d['pos']:2d}: {match_str} | neuron={d['neuron_token']!r:10s} ref={d['ref_token']!r:10s}"
+        )
+
+    result = {
+        "test": "text_generation_life",
+        "pass": True,
+        "prompt": prompt,
+        "generated_text": generated_text[:200],
+        "ref_response": ref_response[:200],
+        "n_generated": len(generated_ids),
+        "gen_time_s": round(gen_time, 2),
+        "token_match_rate": f"{matches}/{n_compare} ({rate:.1%})",
+        "details": [],
+    }
+
+    # Check the output is coherent English (not garbage)
+    if len(generated_text.strip()) < 10:
+        result["pass"] = False
+        result["details"].append("Generated text too short or empty")
+
+    status = "PASS" if result["pass"] else "FAIL"
+    print(f"\n  Result: {status}")
+    return result
+
+
+def test_3_multimodal_generation(adapter, tokenizer, max_new_tokens):
+    """Test 3: Multimodal generation with real image."""
+    print("\n" + "=" * 60)
+    print("TEST 3: Multimodal generation")
+    print("=" * 60)
+
+    import os
+
+    if not os.path.exists(TEST_IMAGE):
+        print(f"WARNING: Test image not found at {TEST_IMAGE}, using random pixels")
+        pixel_values = torch.randn(1, 3, 448, 448)
+    else:
+        try:
+            from PIL import Image
+            from torchvision import transforms
+
+            img = Image.open(TEST_IMAGE).convert("RGB")
+            transform = transforms.Compose(
+                [
+                    transforms.Resize((448, 448)),
+                    transforms.ToTensor(),
+                    transforms.Normalize(
+                        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+                    ),
+                ]
+            )
+            pixel_values = transform(img).unsqueeze(0)
+            print(f"Loaded test image: {TEST_IMAGE}")
+        except Exception as e:
+            print(f"WARNING: Failed to load image ({e}), using random pixels")
+            pixel_values = torch.randn(1, 3, 448, 448)
+
+    # Build multimodal input
+    IMG_CONTEXT_ID = 151667
+    IMG_START_ID = 151665
+    IMG_END_ID = 151666
+
+    text_prompt = "Describe this image briefly."
+    text_ids = tokenizer(text_prompt, return_tensors="pt").input_ids[0]
+
+    img_tokens = torch.full((256,), IMG_CONTEXT_ID, dtype=torch.long)
+    full_ids = torch.cat(
+        [
+            text_ids,
+            torch.tensor([IMG_START_ID]),
+            img_tokens,
+            torch.tensor([IMG_END_ID]),
+        ]
+    ).unsqueeze(0)
+
+    prompt_len = full_ids.shape[-1]
+    attention_mask = torch.ones_like(full_ids)
+
+    print(f"Input IDs shape: {full_ids.shape}")
+    print(f"Generating {max_new_tokens} tokens with image...")
+
+    start = time.time()
+    output_ids = generate_text(
+        adapter,
+        tokenizer,
+        full_ids,
+        attention_mask,
+        max_new_tokens,
+        pixel_values=pixel_values,
+    )
+    gen_time = time.time() - start
+
+    generated_ids = output_ids[0, prompt_len:]
+    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+
+    print(f"\nGenerated ({gen_time:.2f}s, {len(generated_ids)} tokens):")
+    print(f"  {generated_text!r}")
+
+    result = {
+        "test": "multimodal_generation",
+        "pass": True,
+        "prompt": text_prompt,
+        "generated_text": generated_text[:300],
+        "n_generated": len(generated_ids),
+        "gen_time_s": round(gen_time, 2),
+        "details": [],
+    }
+
+    # Coherence checks
+    if len(generated_text.strip()) < 5:
+        result["pass"] = False
+        result["details"].append("Generated text too short or empty")
+
+    # Check not degenerate (repeating same token)
+    if len(generated_ids) > 5:
+        unique_tokens = len(set(generated_ids.tolist()))
+        unique_ratio = unique_tokens / len(generated_ids)
+        result["unique_token_ratio"] = round(unique_ratio, 3)
+        if unique_ratio < 0.1:
+            result["pass"] = False
+            result["details"].append(
+                f"Degenerate output: only {unique_tokens} unique tokens in {len(generated_ids)}"
+            )
+
+    status = "PASS" if result["pass"] else "FAIL"
+    print(f"\n  Result: {status}")
+    return result
+
+
+def test_4_state_reset(adapter, tokenizer, max_new_tokens):
+    """Test 4: State reset - consecutive generations produce correct output."""
+    print("\n" + "=" * 60)
+    print("TEST 4: State reset (consecutive generations)")
+    print("=" * 60)
+
+    prompt = "The capital of France is"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = inputs.attention_mask
+    prompt_len = input_ids.shape[-1]
+
+    # Run twice
+    print("Run 1...")
+    output_ids_1 = generate_text(adapter, tokenizer, input_ids, attention_mask, 10)
+    text_1 = tokenizer.decode(output_ids_1[0, prompt_len:], skip_special_tokens=True)
+    print(f"  Output 1: {text_1!r}")
+
+    print("Run 2...")
+    output_ids_2 = generate_text(adapter, tokenizer, input_ids, attention_mask, 10)
+    text_2 = tokenizer.decode(output_ids_2[0, prompt_len:], skip_special_tokens=True)
+    print(f"  Output 2: {text_2!r}")
+
+    # Check they match (deterministic greedy decoding)
+    match = torch.equal(output_ids_1, output_ids_2)
+
+    result = {
+        "test": "state_reset",
+        "pass": match,
+        "output_1": text_1,
+        "output_2": text_2,
+        "deterministic": match,
+        "details": [],
+    }
+
+    if not match:
+        result["details"].append(
+            f"Non-deterministic: run1={text_1!r} vs run2={text_2!r}"
+        )
+
+    status = "PASS" if result["pass"] else "FAIL"
+    print(f"\n  Result: {status} (deterministic={match})")
+    return result
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--skip-compile", action="store_true")
+    parser.add_argument("--max-new-tokens", type=int, default=32)
+    args = parser.parse_args()
+
+    print("=" * 60)
+    print("Task 010: TKG Validation - InternVL3-8B-Instruct")
+    print("=" * 60)
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    model = load_model(skip_compile=args.skip_compile)
+
+    # Wrap with HuggingFaceGenerationAdapter
+    adapter = HuggingFaceGenerationAdapter(model)
+
+    all_results = []
+
+    # Test 1: Text generation - France
+    r1 = test_1_text_generation(adapter, tokenizer, args.max_new_tokens)
+    all_results.append(r1)
+
+    # Test 2: Text generation - meaning of life
+    r2 = test_2_text_generation_life(adapter, tokenizer, args.max_new_tokens)
+    all_results.append(r2)
+
+    # Test 3: Multimodal generation
+    r3 = test_3_multimodal_generation(adapter, tokenizer, args.max_new_tokens)
+    all_results.append(r3)
+
+    # Test 4: State reset
+    r4 = test_4_state_reset(adapter, tokenizer, args.max_new_tokens)
+    all_results.append(r4)
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("VALIDATION SUMMARY")
+    print("=" * 60)
+    all_pass = True
+    for r in all_results:
+        status = "PASS" if r["pass"] else "FAIL"
+        print(f"  {r['test']}: {status}")
+        if r.get("token_match_rate"):
+            print(f"    token match: {r['token_match_rate']}")
+        if r.get("gen_time_s"):
+            print(
+                f"    gen time: {r['gen_time_s']}s ({r.get('n_generated', '?')} tokens)"
+            )
+        if r["details"]:
+            for d in r["details"]:
+                print(f"    - {d}")
+        if not r["pass"]:
+            all_pass = False
+
+    print()
+    if all_pass:
+        print("OVERALL: ALL TESTS PASSED")
+    else:
+        print("OVERALL: SOME TESTS FAILED")
+
+    # Save results
+    results_path = "/mnt/models/validation_tkg_results.json"
+    with open(results_path, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {results_path}")
+
+    return 0 if all_pass else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/contrib/models/InternVL3-8B-Instruct/vllm/apply_internvl_patches.py b/contrib/models/InternVL3-8B-Instruct/vllm/apply_internvl_patches.py
new file mode 100644
index 00000000..d2e02723
--- /dev/null
+++ b/contrib/models/InternVL3-8B-Instruct/vllm/apply_internvl_patches.py
@@ -0,0 +1,456 @@
+#!/usr/bin/env python3
+"""
+Apply 6 vLLM patches for InternVL3-8B-Instruct on Neuron.
+
+Patches vllm-neuron 0.5.0 to support InternVL3 via NxDI contrib model.
+
+Requires:
+  - NxDI contrib files at /home/ubuntu/internvl3_contrib/src/
+  - vLLM-neuron installed at /vllm/ (editable install)
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+    python apply_internvl_patches.py [--dry-run] [--revert]
+"""
+
+import argparse
+import os
+import re
+import shutil
+import sys
+
+VLLM_NEURON_DIR = "/vllm/vllm_neuron/worker"
+CONSTANTS_FILE = os.path.join(VLLM_NEURON_DIR, "constants.py")
+LOADER_FILE = os.path.join(VLLM_NEURON_DIR, "neuronx_distributed_model_loader.py")
+RUNNER_FILE = os.path.join(VLLM_NEURON_DIR, "neuronx_distributed_model_runner.py")
+
+CONTRIB_DIR = "/home/ubuntu/internvl3_contrib/src"
+
+
+def backup_file(filepath):
+    """Create .orig backup if it doesn't exist yet."""
+    backup = filepath + ".orig"
+    if not os.path.exists(backup):
+        shutil.copy2(filepath, backup)
+        print(f"  Backed up: {filepath} -> {backup}")
+    else:
+        print(f"  Backup exists: {backup}")
+
+
+def revert_file(filepath):
+    """Revert from .orig backup."""
+    backup = filepath + ".orig"
+    if os.path.exists(backup):
+        shutil.copy2(backup, filepath)
+        print(f"  Reverted: {filepath}")
+    else:
+        print(f"  No backup found for: {filepath}")
+
+
+def read_file(filepath):
+    with open(filepath, "r") as f:
+        return f.read()
+
+
+def write_file(filepath, content):
+    with open(filepath, "w") as f:
+        f.write(content)
+
+
+def apply_patch_1_constants(dry_run=False):
+    """Patch 1: Add InternVLChatModel to NEURON_MULTI_MODAL_MODELS."""
+    print("\n--- Patch 1: Register InternVLChatModel in constants.py ---")
+
+    content = read_file(CONSTANTS_FILE)
+
+    if "InternVLChatModel" in content:
+        print("  Already applied.")
+        return True
+
+    # Add after last entry in NEURON_MULTI_MODAL_MODELS list
+    old = '    "Qwen3VLForConditionalGeneration",\n]'
+    new = '    "Qwen3VLForConditionalGeneration",\n    "InternVLChatModel",\n]'
+
+    if old not in content:
+        print("  ERROR: Cannot find insertion point in constants.py")
+        return False
+
+    if not dry_run:
+        backup_file(CONSTANTS_FILE)
+        content = content.replace(old, new)
+        write_file(CONSTANTS_FILE, content)
+
+    print("  Applied: Added 'InternVLChatModel' to NEURON_MULTI_MODAL_MODELS")
+    return True
+
+
+def apply_patch_2_llm_config_fallback(dry_run=False):
+    """Patch 2: Handle llm_config fallback in _get_model_configs()."""
+    print("\n--- Patch 2: llm_config fallback in _get_model_configs ---")
+
+    content = read_file(LOADER_FILE)
+
+    if "llm_config" in content and 'getattr(config, "llm_config"' in content:
+        print("  Already applied.")
+        return True
+
+    # Find the line: config = getattr(config, "text_config", None)
+    # Replace with fallback that also checks llm_config
+    old = '    if architecture in NEURON_MULTI_MODAL_MODELS:\n        config = getattr(config, "text_config", None)'
+    new = '    if architecture in NEURON_MULTI_MODAL_MODELS:\n        config = getattr(config, "text_config", None)\n        if config is None:\n            config = getattr(config, "llm_config", None)'
+
+    if old not in content:
+        # Try alternative: maybe the indentation is different
+        print("  ERROR: Cannot find insertion point for llm_config fallback")
+        return False
+
+    # Wait, the second getattr uses the already-None config.
+    # We need: try text_config first, if None try llm_config from the ORIGINAL config.
+    # Let me fix: store original config first.
+    old = '    if architecture in NEURON_MULTI_MODAL_MODELS:\n        config = getattr(config, "text_config", None)'
+    new = (
+        "    if architecture in NEURON_MULTI_MODAL_MODELS:\n"
+        "        # InternVL uses llm_config instead of text_config\n"
+        '        config = getattr(config, "text_config", None) or getattr(config, "llm_config", None)'
+    )
+
+    if old not in content:
+        print("  ERROR: Cannot find insertion point for llm_config fallback")
+        return False
+
+    if not dry_run:
+        backup_file(LOADER_FILE)
+        content = content.replace(old, new)
+        write_file(LOADER_FILE, content)
+
+    print("  Applied: Added llm_config fallback in _get_model_configs()")
+    return True
+
+
+def apply_patch_3_internvl_class(dry_run=False):
+    """Patch 3: Add NeuronInternVL3ForCausalLM class with dynamic import from contrib."""
+    print("\n--- Patch 3: Add NeuronInternVL3ForCausalLM class ---")
+
+    content = read_file(LOADER_FILE)
+
+    if "NeuronInternVL3ForCausalLM" in content:
+        print("  Already applied.")
+        return True
+
+    # Insert the class right before the get_neuron_model function
+    class_code = '''
+class NeuronInternVL3ForCausalLM(NeuronMultiModalCausalLM):
+    """InternVL3-8B-Instruct via NxDI contrib model."""
+
+    def _save_pretrained_model(self, model_name: str):
+        # InternVL3 uses trust_remote_code; save locally if needed
+        from transformers import AutoModel
+        hf_model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+        saved_path = os.path.join("local-models", model_name)
+        hf_model.save_pretrained(saved_path)
+        return saved_path
+
+    def load_weights(self, model_name_or_path: str, architecture: str, **kwargs):
+        """Override to dynamically import contrib model class."""
+        import sys
+        contrib_path = "/home/ubuntu/internvl3_contrib/src"
+        if contrib_path not in sys.path:
+            sys.path.insert(0, contrib_path)
+
+        from modeling_internvl3 import NeuronInternVL3ForCausalLM as ContribModel
+        from neuronx_distributed_inference.utils.constants import MODEL_TYPES
+
+        # Register the contrib model in MODEL_TYPES so _get_neuron_model_cls finds it
+        if "internvl3" not in MODEL_TYPES:
+            MODEL_TYPES["internvl3"] = {}
+        MODEL_TYPES["internvl3"]["causal-lm"] = ContribModel
+
+        return super().load_weights(model_name_or_path, architecture, **kwargs)
+
+    def execute_model(self, model_input, **kwargs):
+        """Forward multimodal inputs for InternVL3."""
+        pixel_values = None
+        if (
+            model_input.multi_modal_kwargs is not None
+            and model_input.multi_modal_kwargs.get("pixel_values") is not None
+        ):
+            pixel_values = model_input.multi_modal_kwargs["pixel_values"]
+
+        # Build vision_mask from image token IDs
+        # InternVL3 uses IMG_CONTEXT token (id=151667) for image placeholders
+        IMG_CONTEXT_ID = 151667
+        vision_mask = (model_input.input_tokens == IMG_CONTEXT_ID).unsqueeze(-1)
+
+        hidden_states = self.forward(
+            input_ids=model_input.input_tokens,
+            positions=model_input.position_ids,
+            input_block_ids=model_input.input_block_ids,
+            sampling_params=model_input.sampling_params,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+            **kwargs,
+        )
+        return hidden_states
+
+    def forward(
+        self,
+        input_ids,
+        positions,
+        input_block_ids,
+        sampling_params,
+        pixel_values=None,
+        vision_mask=None,
+        **kwargs,
+    ):
+        """Forward pass with multimodal support for InternVL3."""
+        # Cast vision tensors to the configured dtype
+        if pixel_values is not None:
+            dtype = self.model.config.vision_config.neuron_config.torch_dtype
+            if isinstance(pixel_values, torch.Tensor):
+                pixel_values = pixel_values.to(dtype)
+            elif isinstance(pixel_values, list):
+                pixel_values = [p.to(dtype) for p in pixel_values]
+
+        with self._reordered(
+            input_block_ids,
+            input_ids=input_ids,
+            positions=positions,
+            sampling_params=sampling_params,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+        ) as (sorted_ids, inputs, restore):
+            output = self.model(
+                inputs["input_ids"].to(torch.int32),
+                attention_mask=None,
+                position_ids=inputs["positions"].to(torch.int32),
+                seq_ids=sorted_ids.flatten().to(torch.int32),
+                pixel_values=inputs.get("pixel_values"),
+                vision_mask=inputs.get("vision_mask"),
+                sampling_params=inputs["sampling_params"],
+                **kwargs,
+            )
+
+            if self.model.config.neuron_config.on_device_sampling_config:
+                output = output.hidden_states
+            else:
+                output = output.logits[:, -1, :]
+
+            return restore(output)
+
+
+'''
+
+    # Insert before: def _get_model_configs
+    anchor = "\ndef _get_model_configs(config: PretrainedConfig) -> str:"
+    if anchor not in content:
+        print("  ERROR: Cannot find insertion anchor for NeuronInternVL3ForCausalLM")
+        return False
+
+    if not dry_run:
+        # backup already done in patch 2, but just in case
+        backup_file(LOADER_FILE)
+        content = content.replace(anchor, class_code + anchor)
+        write_file(LOADER_FILE, content)
+
+    print("  Applied: Added NeuronInternVL3ForCausalLM class")
+    return True
+
+
+def apply_patch_4_get_neuron_model(dry_run=False):
+    """Patch 4: Add InternVLChatModel branch in get_neuron_model()."""
+    print("\n--- Patch 4: Add InternVLChatModel in get_neuron_model() ---")
+
+    content = read_file(LOADER_FILE)
+
+    if 'architecture == "InternVLChatModel"' in content:
+        print("  Already applied.")
+        return True
+
+    # Add after Qwen3VL branch
+    old = (
+        '    elif architecture == "Qwen3VLForConditionalGeneration":\n'
+        "        model = NeuronQwen3VLForCausalLM(model_config.hf_config)\n"
+        "    else:\n"
+        "        model = NeuronCausalLM(model_config.hf_config)"
+    )
+    new = (
+        '    elif architecture == "Qwen3VLForConditionalGeneration":\n'
+        "        model = NeuronQwen3VLForCausalLM(model_config.hf_config)\n"
+        '    elif architecture == "InternVLChatModel":\n'
+        "        model = NeuronInternVL3ForCausalLM(model_config.hf_config)\n"
+        "    else:\n"
+        "        model = NeuronCausalLM(model_config.hf_config)"
+    )
+
+    if old not in content:
+        print("  ERROR: Cannot find insertion point for InternVLChatModel branch")
+        return False
+
+    if not dry_run:
+        backup_file(LOADER_FILE)
+        content = content.replace(old, new)
+        write_file(LOADER_FILE, content)
+
+    print("  Applied: Added InternVLChatModel branch in get_neuron_model()")
+    return True
+
+
+def apply_patch_5_multimodal_processor(dry_run=False):
+    """Patch 5: Add internvl_chat to multimodal data processor."""
+    print("\n--- Patch 5: Add internvl_chat in multimodal processor ---")
+
+    content = read_file(RUNNER_FILE)
+
+    if "internvl_chat" in content:
+        print("  Already applied.")
+        return True
+
+    # Add after llama4 branch
+    old = (
+        '        elif self.model.model.config.model_type == "llama4":\n'
+        "            pass  # llama4 doesn't require special processing\n"
+        "        else:"
+    )
+    new = (
+        '        elif self.model.model.config.model_type == "llama4":\n'
+        "            pass  # llama4 doesn't require special processing\n"
+        '        elif self.model.model.config.model_type == "internvl_chat":\n'
+        "            pass  # InternVL3 handles vision processing internally via NxDI contrib\n"
+        "        else:"
+    )
+
+    if old not in content:
+        print("  ERROR: Cannot find insertion point for internvl_chat")
+        return False
+
+    if not dry_run:
+        backup_file(RUNNER_FILE)
+        content = content.replace(old, new)
+        write_file(RUNNER_FILE, content)
+
+    print("  Applied: Added internvl_chat to multimodal processor")
+    return True
+
+
+def apply_patch_6_model_cls(dry_run=False):
+    """Patch 6: Special-case InternVLChatModel in _get_neuron_model_cls()."""
+    print("\n--- Patch 6: Handle InternVLChatModel in _get_neuron_model_cls ---")
+
+    content = read_file(LOADER_FILE)
+
+    # Check if already applied: look for our special case comment in the function
+    func_start = content.find("def _get_neuron_model_cls(")
+    if func_start != -1 and "Special case: InternVLChatModel" in content[func_start:]:
+        print("  Already applied.")
+        return True
+
+    # The issue: InternVLChatModel doesn't follow *ForConditionalGeneration pattern.
+    # The parser splits on "For" and lowercases. "InternVLChatModel".split("For", 1)
+    # gives ["InternVLChatModel"] (no "For"), hitting the KeyError path.
+    # We need to add a special case BEFORE the split logic.
+
+    old = '    try:\n        if "For" in architecture:'
+    new = (
+        "    # Special case: InternVLChatModel does not follow *For* naming convention\n"
+        '    if architecture == "InternVLChatModel":\n'
+        "        import sys\n"
+        '        contrib_path = "/home/ubuntu/internvl3_contrib/src"\n'
+        "        if contrib_path not in sys.path:\n"
+        "            sys.path.insert(0, contrib_path)\n"
+        "        from modeling_internvl3 import NeuronInternVL3ForCausalLM\n"
+        "        return NeuronInternVL3ForCausalLM\n"
+        "\n"
+        '    try:\n        if "For" in architecture:'
+    )
+
+    if old not in content:
+        print("  ERROR: Cannot find insertion point in _get_neuron_model_cls")
+        return False
+
+    if not dry_run:
+        backup_file(LOADER_FILE)
+        content = content.replace(old, new)
+        write_file(LOADER_FILE, content)
+
+    print("  Applied: Added InternVLChatModel special case in _get_neuron_model_cls()")
+    return True
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Apply InternVL3 vLLM patches")
+    parser.add_argument(
+        "--dry-run", action="store_true", help="Show patches without applying"
+    )
+    parser.add_argument(
+        "--revert", action="store_true", help="Revert all patches from backups"
+    )
+    args = parser.parse_args()
+
+    if args.revert:
+        print("Reverting all patches...")
+        for f in [CONSTANTS_FILE, LOADER_FILE, RUNNER_FILE]:
+            revert_file(f)
+        print("\nDone. Reverted to original files.")
+        return
+
+    # Verify contrib files exist
+    if not os.path.exists(CONTRIB_DIR):
+        print(f"ERROR: Contrib directory not found: {CONTRIB_DIR}")
+        sys.exit(1)
+    for f in [
+        "modeling_internvl3.py",
+        "modeling_internvl3_text.py",
+        "modeling_internvl3_vision.py",
+    ]:
+        if not os.path.exists(os.path.join(CONTRIB_DIR, f)):
+            print(f"ERROR: Missing contrib file: {f}")
+            sys.exit(1)
+
+    print("=" * 60)
+    print("Applying InternVL3-8B vLLM patches")
+    print("=" * 60)
+    print(f"vLLM-neuron dir: {VLLM_NEURON_DIR}")
+    print(f"Contrib dir: {CONTRIB_DIR}")
+    if args.dry_run:
+        print("*** DRY RUN -- no files will be modified ***")
+
+    results = []
+    results.append(("Patch 1: constants.py", apply_patch_1_constants(args.dry_run)))
+    results.append(
+        (
+            "Patch 2: llm_config fallback",
+            apply_patch_2_llm_config_fallback(args.dry_run),
+        )
+    )
+    results.append(
+        ("Patch 3: NeuronInternVL3 class", apply_patch_3_internvl_class(args.dry_run))
+    )
+    results.append(
+        ("Patch 4: get_neuron_model", apply_patch_4_get_neuron_model(args.dry_run))
+    )
+    results.append(
+        (
+            "Patch 5: multimodal processor",
+            apply_patch_5_multimodal_processor(args.dry_run),
+        )
+    )
+    results.append(
+        ("Patch 6: _get_neuron_model_cls", apply_patch_6_model_cls(args.dry_run))
+    )
+
+    print("\n" + "=" * 60)
+    print("Summary:")
+    for name, success in results:
+        status = "OK" if success else "FAILED"
+        print(f"  [{status}] {name}")
+
+    failed = sum(1 for _, s in results if not s)
+    if failed:
+        print(f"\n{failed} patch(es) failed!")
+        sys.exit(1)
+    else:
+        print(f"\nAll {len(results)} patches applied successfully.")
+
+
+if __name__ == "__main__":
+    main()