Model Details

Domain:

Vision

Task:

Image classification

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at this https URL.

Benchmarking

FLOPs3.46e+22

Notes: 6 FLOP / token / parameter * 5*10^9 parameters * 2304000000000/2 tokens [see dataset size notes] = 3.456e+22 FLOP

Training

Training Code Accessibilityhttps://huggingface.co/QuanSun/EVA-CLIP MIT license the code here seems to be only inference code https://github.com/baaivision/EVA/tree/master/EVA-CLIP

HardwareNVIDIA RTX A1000

Hardware Quantity144

Size Notes: from table 1(a): 9B samples seen image size 224^2 batch size: 144k samples 9*10^9*(224/14)^2 = 2.304e+12 image tokens 50% of patches are randomly masked (to account for it when estimating compute)

Parameters

Parameters5000000000

Notes: 5b (table 1(a)) image parameters: 4.4B text parameters: 695M

Authors

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao

Beijing Academy of Artificial Intelligence / BAAI,Huazhong University of Science and Technology | EVA-CLIP EVA-02-CLIP-E 14 - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Vision

Task:

Image classification

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs3.46e+22

Notes: 6 FLOP / token / parameter * 5*10^9 parameters * 2304000000000/2 tokens [see dataset size notes] = 3.456e+22 FLOP

Training

Training Code Accessibilityhttps://huggingface.co/QuanSun/EVA-CLIP MIT license the code here seems to be only inference code https://github.com/baaivision/EVA/tree/master/EVA-CLIP

HardwareNVIDIA RTX A1000

Hardware Quantity144

Parameters

Parameters5000000000

Notes: 5b (table 1(a)) image parameters: 4.4B text parameters: 695M

Authors

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors