Model Details

Domain:

Task:

Visual question answering

Model Access:

Open weights (non-commercial)

Introduction

VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge.

Training

Training Code Accessibility

cc-by-nc-4.0 https://huggingface.co/Efficient-Large-Model/VILA1.5-40b Apache 2.0 https://github.com/NVLabs/VILA

Parameters

40000000000

Related Models

VILA1.5-40B - Use Model

VILA1.5-40B - Use Model

Model Details

Introduction

Training

Parameters

Related Models

Cosmos-Predict2.5 2B

NVIDIA-Nemotron-Nano-12B-v2

NVIDIA-Nemotron-Nano-9B-v2

Canary 1B v2

VILA1.5-40B - Use Model

VILA1.5-40B - Use Model

Model Details

Introduction

Training

Parameters

Related Models

Cosmos-Predict2.5 2B

NVIDIA-Nemotron-Nano-12B-v2

NVIDIA-Nemotron-Nano-9B-v2

Canary 1B v2