Introduction

Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. The Mega model was evaluated and showed success on different tasks including synthetic news generation, and zero-shot question answering. For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study conducted with human evaluators showed the significant success of AraGPT2-mega in generating news articles that are difficult to distinguish from articles written by humans. We thus develop and release an automatic discriminator model with a 98% percent accuracy in detecting model-generated text. The models are also publicly available, hoping to encourage new research directions and applications for Arabic NLP.

Benchmarking

FLOPs

2e+21

Notes: source: https://github.com/lightonai/akronomicon/blob/10adaca9c74afa7d11f196947e410d248f25abe9/akrodb/American%20University%20of%20Beirut/AraGPT2-Mega.json Akronomicon uses units of petaflop/s-days. 20 petaflop/s-days ~= 2e21 FLOP. Our own validation of this estimate is below. For the Mega model: 9 days on a TPUv3-128, bfloat16 precision (from author communication) A TPUv3-128 has 128 cores (you can infer this from footnote 9 on p.4 of the paper - 128 * 16GB = 2TB). TPUv3 has 2 cores per chip. So 64 chips. TPUv3 FLOP/s: 1.23E+14 Utilization: use default value of 30% for Language domain (https://epoch.ai/blog/estimating-training-compute) 64 chips * 30% * 1.23E+14 FLOP/s * 9 days * 24h/day * 3600s/h ~= 2e21 FLOP num of examples (seq len = 1024): 9.7M batch size: 256 number of steps: 800k 6 FLOP / token / parameter * 1.46 * 10^9 parameters * 800000 steps * 256 sequences per step * 1024 tokens per step = 1.8371052e+21 FLOP

Training

Training Code Accessibility

apache-like license: https://github.com/aub-mind/arabert/blob/master/aragpt2/LICENSE https://huggingface.co/aubmindlab/aragpt2-mega

Hardware

Google TPU v3

Hardware Quantity

128

Size Notes: "The total dataset size is 77GB with 8.8B words [word count was done after preprocessing, where a white space is inserted before and after punctuations, brackets, numbers... which increased the total word count]" num of examples (seq len = 1024): 9.7M batch size: 256 number of steps: 800k 9.7*10^6 examples * 1024 tokens per sequence = 9932800000 tokens

Introduction

Benchmarking

FLOPs

2e+21

Training

Training Code Accessibility

apache-like license: https://github.com/aub-mind/arabert/blob/master/aragpt2/LICENSE https://huggingface.co/aubmindlab/aragpt2-mega

Hardware

Google TPU v3

Hardware Quantity

128

Model Details

Introduction

Benchmarking

Training

Parameters

Authors

Related Models

AraELECTRA

AraBERT

AraBERT LArge v2

Model Details

Introduction

Benchmarking

Training

Parameters

Authors

Related Models

AraELECTRA

AraBERT

AraBERT LArge v2

AraGPT2-Mega - Use Model

AraGPT2-Mega - Use Model

Model Details

Introduction

Benchmarking

Training

Parameters

Authors

Related Models

AraELECTRA

AraBERT

AraBERT LArge v2

AraGPT2-Mega - Use Model

AraGPT2-Mega - Use Model

Model Details

Introduction

Benchmarking

Training

Parameters

Authors

Related Models

AraELECTRA

AraBERT

AraBERT LArge v2