Model Details

Domain:

Vision

Task:

Image classification

Model Access:

Open weights (non-commercial)

Citations:

1231

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at this https URL.

Benchmarking

FLOPs3.4e+21

Notes: This is a fine-tuned version of ViT-G, which required 3.4e21 to train per PCD/Akronomicon. Fine-tuning compute is likely minor in comparision: "Models are fine-tuned at a batch size of 512 for either 10,000 or 20,000 steps (approximately 4 or 8 epochs)... all models are fine-tuned at 518 × 518 resolution" At 20k steps, we have (518^2) * 512 * 20k = 2.75e12 pixels seen in fine-tuning, compared to (224^2) * 32768 * 5M = 8.22e15 in pre-training.

Training

Training Code Accessibilityno license code here, may just be inference code: https://github.com/mlfoundations/model-soups

Parameters

Parameters1843000000

Notes: This is from the original ViT-G paper

Authors

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

University of Washington,Columbia University,Google,Meta AI,Tel Aviv University | ViT-G model soup - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Vision

Task:

Image classification

Model Access:

Open weights (non-commercial)

Citations:

1231

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs3.4e+21

Training

Training Code Accessibilityno license code here, may just be inference code: https://github.com/mlfoundations/model-soups

Parameters

Parameters1843000000

Notes: This is from the original ViT-G paper

Authors

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors