Palmyra Large was primarily pre-trained with English text. Note that there is still a trace amount of non-English data present within the training corpus that was accessed through CommonCrawl. A causal language modeling (CLM) objective was utilized during the process of the model's pretraining. Similar to GPT-3, Palmyra Large is a member of the same family of models that only contain a decoder. As a result, it was pre-trained utilizing the objective of self-supervised causal language modeling.
Notes: "Palmyra-Large is a 20B parameters causal decoder-only model built by Writer and trained on +800B tokens of Palmyra-Index-Data enhanced with curated corpora." I'm not sure if the 800B is how many tokens the model was trained on, or the size of the dataset. But the dataset linked on HuggingFace has 1T tokens, so 800B as tokens trained is more likely. 20B*800B*6 = 9.6e22
Size Notes: 1 trillion tokens, or 750B words: https://huggingface.co/datasets/Writer/palmyra-data-index
Notes: 20B parameters for Palmyra Large. There is also a 43B version called Palmyra X according to HELM.