Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.
š Hugging Face | š Paper | š Blog | š WebDemo | š£ Discord
Aria is a multimodal native MoE model. It features:
[Jan 20, 2025] ššš Aria is supported in PaddleMIX by Paddle Team.
[Dec 15, 2024] We release Aria-Chat! It is optimized for open-ended and multi-round dialogs, with enhanced reliability and multi-lingual support.
[Dec 1, 2024] We release the base models for Aria (Aria-Base-8K and Aria-Base-64K)! They are fully compatible with this inference & fine-tuning codebase.
[Oct 10, 2024] We release Aria!
pip install -e .
# or install with dev dependencies if you want to contribute to the project
pip install -e .[dev]
pip install grouped_gemm
pip install flash-attn --no-build-isolation
Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
Here is a code snippet to show you how to use Aria with Hugging Face Transformers.
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_id_or_path = "rhymes-ai/Aria"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
image = Image.open(requests.get(image_path, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
output = model.generate(
**inputs,
max_new_tokens=500,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)
We offer additional inference methods, such as utilizing vLLM for enhanced performance. For comprehensive details, please refer to docs/inference.md.
Checkout these inference examples that demonstrate how to use Aria on various applications such as chart understanding, PDF reading, video understanding, etc, available with both Hugging Face Transformers and vLLM backends.
ā ļø Important Note on Fine-tuning: Due to changes in the weight mapping after Aria's integration into transformers, the training code requires specific versions to work properly:
- Use transformers version 4.45.0
- Use model revision "4844f0b5ff678e768236889df5accbe4967ec845"
Note: For optimal fine-tuning performance, install the optional
grouped_gemmdependency:pip install grouped_gemm
We offer both LoRA fine-tuning and full parameter tuning, using various dataset types:
For a quick try, visit the examples folder and choose one of the fine-tuning examples. If you would like to fine-tune from base models (recommended when you have a large database), please change the following model paths in the configs (full or lora)
model_name_or_path: rhymes-ai/Aria
tokenizer_path: rhymes-ai/Aria
to the ones corresponding to one of the base models:
model_name_or_path: rhymes-ai/Aria-Base-64K # rhymes-ai/Aria-Base-8K
tokenizer_path: rhymes-ai/Aria-Base-64K # rhymes-ai/Aria-Base-8K
Please refer to custom_dataset.md for how to prepare your dataset.
After preparing your dataset, follow these steps to fine-tune Aria using LoRA:
recipes/config_lora.yaml. Locate the dataset_mixer section and update it with your dataset paths:dataset_mixer:
"path/to/dataset1": 1
"path/to/dataset2": 0.5
"path/to/dataset3": 2
Note on dataset mixing: Aria supports combining multiple datasets with different sampling rates. In the example above:
dataset1will be used entirely (weight 1)dataset2will use 50% of its data (weight 0.5)dataset3will be used twice (weight 2)
python aria/train.py --config recipes/config_lora.yaml
accelerate library:accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_lora.yaml --num_processes [number_of_gpus]
recipes/accelerate_configs/--num_processes argument to match your available GPUsInference with the fine-tuned model:
See inference with LoRA support for how to inference with the fine-tuned model.
Everything is the same as the LoRA fine-tuning process, except for the configuration file recipes/config_full.yaml.
Full parameter tuning consumes more GPU memory, thus multiple GPUs are required. The following command has been tested on 8 A100 (80GB) GPUs.
accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_full.yaml
If you encounter out-of-memory errors, try reducing the per_device_train_batch_size in the config file. Adjust the gradient_accumulation_steps accordingly to maintain the effective training batch size.
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
Memory consumption varies across datasets. Generally, more memory is required for multi-image and video datasets. Adjust the deepspeed_config parameters to optimize memory consumption, such as using zero_stage 3 and offloading parameters and optimizer to the CPU.
deepspeed_config:
gradient_accumulation_steps: auto
gradient_clipping: auto
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 3
First, you need to extract the FP32 consolidated weights from ZeRO 1, 2, or 3 DeepSpeed checkpoints:
cd /path/to/your/output/dir
python zero_to_fp32.py . pytorch_model.bin
See inference.md for instructions on how to perform inference with the fine-tuned model.
If you find our work helpful, please consider citing.
@article{aria,
title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
year={2024},
journal={arXiv preprint arXiv:2410.05993},
}
Notes: 6 FLOP / parameter / token * 3,5 * 10^9 active parameters * 6,8 * 10^12 tokens = 1.428e+23 FLOP
Size Notes: Data. ARIA is pre-trained on 6.4T language tokens and 400B multimodal tokens.
Notes: ARIA MoE activates 3.5B parameters per text token and has a total of 24.9B parameter