Overview

Qwen2.5-VL 72B is a flagship multimodal large language model, distinguished by its 72 billion parameters and advanced capabilities in vision and language integration. This model excels in a wide range of tasks, including sophisticated visual understanding, robust multilingual OCR, and complex document and video analysis. Unlike its predecessors, Qwen2.5-VL introduces dynamic resolution and temporal video alignment, allowing it to accurately process and summarize long-form videos and pinpoint events with second-level granularity. A key feature is its “agentic” ability, which enables it to act as a visual agent for interactive tasks, such as operating a computer or mobile device based on visual input and instructions. The model also offers precise object grounding with bounding boxes and can generate structured outputs in formats like JSON, making it highly suitable for applications requiring data extraction from tables, forms, and other complex layouts.

This tutorial demonstrates how to deploy and run Qwen2.5-VL-72B-Instruct on Argonne’s Polaris supercomputer, leveraging its GPU resources for efficient inference. The model supports various image formats and can handle complex visual understanding tasks with high accuracy.

Official Repository | Model on Hugging Face

Prerequisites

Before starting this tutorial, ensure you have:

  • ALCF Account: Active account with access to Polaris system
  • Project Allocation: Computing time allocation on a project (you’ll need the project name)
  • MFA Setup: CRYPTOCard or MobilePASS+ token configured for authentication
  • Basic Knowledge: Familiarity with SSH, Linux command line, and Python virtual environments
  • Python Experience: Understanding of deep learning concepts and PyTorch/Transformers

Setup Environment

Step 1: Set Environment Variables

# Replace <YOUR_PROJECT> with your actual ALCF project name
PROJECT_NAME=<YOUR_PROJECT>
export UV_CACHE_DIR="/eagle/$PROJECT_NAME/cache/uv"
export PIP_CACHE_DIR="/eagle/$PROJECT_NAME/cache/pip"
export HF_HOME="/eagle/$PROJECT_NAME/cache/huggingface"
export VLLM_CACHE_ROOT="/eagle/$PROJECT_NAME/cache/vllm"

# Create cache directories if they don't exist
mkdir -p $UV_CACHE_DIR $PIP_CACHE_DIR $HF_HOME $VLLM_CACHE_ROOT

Step 2: Load Modules and Setup Python Environment

# Load required modules
module use /soft/modulefiles
module load conda
conda activate base

# Create and activate virtual environment if it doesn't exist
if [ ! -d "$UV_CACHE_DIR/qwen_venv" ]; then
    python -m venv $UV_CACHE_DIR/qwen_venv
fi
source $UV_CACHE_DIR/qwen_venv/bin/activate

# Install required packages
pip install uv
uv pip install torch torchvision transformers accelerate qwen-vl-utils

Prepare Demo Image

First, download a demo image. Due to network restrictions on Polaris, download the image locally and transfer it:

# On your local machine:
curl https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg -o demo.jpeg

# Transfer to Polaris:
scp demo.jpeg <username>@polaris.alcf.anl.gov:/path/to/your/working/directory/

Resolve Library Conflicts

The default LD_LIBRARY_PATH on Polaris can cause cuDNN compatibility issues:

unset LD_LIBRARY_PATH

Implementation Code

Create a Python script with the following code from the official repo:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Example Output

When you run the script successfully, you should see output similar to:

Fetching 1 files: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 3253.92it/s]
['The image depicts a heartwarming scene on a sandy beach during what appears to be either sunrise or sunset, given the soft, warm lighting. A woman is sitting cross-legged on the sand, wearing a plaid shirt and jeans, with her hair flowing freely. She is smiling and engaging in a playful interaction with a light-colored dog, possibly a Labrador Retriever. The dog is sitting upright, wearing a harness, and is extending its paw towards the woman\'s hand as if they are playing a game of "paw shake." The ocean waves can be seen in the background, adding to the serene and joyful atmosphere of the moment']

Additional Resources