META’s LLaMA: A small language model beating giants


META open-source model will help us to understand how LMs biases arise.


Large models have demonstrated creative abilities, to solve mathematical exercises (or even propose new mathematical theorems), and predict the structure of proteins, and other surprising behaviors. Unfortunately, many of these are not open-source and cannot be used by the community. META has announced a new model trained on a huge amount of heads. The best part? It will be open-source.

Let’s find out together in this article.

Foundation models: what are and why do need them?

Image by Brett Jordan at

A foundation model is basically a big model trained unsupervised on a huge amount of data. A foundation model is then used for a variety of different downstream tasks without the need for re-train or with a little fine-tuning.

In particular, Large Languages Models (LLMs) are trained on massive amounts of text and have shown surprising behaviors (performing new tasks according to textual instructions or using only a few examples). In general, these models were trained following the scaling law: the more parameters, the better the result. Here we have seen why this is not exactly a good idea.

On the one hand, it is true that these emergent behaviors are noticed only at a certain scale. On the other hand other studies that the best performance can be achieved even with small models trained on more or better data. Moreover, these models are extremely expensive to train, and large models are also expensive in inference.

Hoffman had shown that for small models performance improves by increasing the number of data (recommending, for example, that a 10B model be trained with 200B tokens). Apparently, this is not the limit, but with more data, the performance still improves.

Estimated optimal training FLOPs and training tokens for various model sizes. image from here.
Estimated optimal training FLOPs and training tokens for various model sizes. image from here.

Why is LLaMA important?

  • This model despite being 65B in parameters in its largest version has comparable performance with much larger models (Chinchilla or PaLM-540B). The 13B version outperformed GPT-3 in several benchmarks despite being 10x smaller.
  • In addition, this model was trained using only publicly available data (in contrast other models use data that are unavailable or undocumented, such as social media conversations or book-2TB).
  • They analyzed in detail the social bias, and toxicity encoded.

Dataset, architecture, and training

Image by Iñaki del Olmo at

The authors used a massive corpus to train the model. They decided to use a mix of resources to cover different domains. However, they decided to use only data that is publicly available and compatible with open sourcing:

  • English CommonCrawl, obtained from the Internet and preprocessed to remove poor-quality text.
  • C4, a similar dataset.
  • GitHub, the authors used the dataset available on Google BigQuery, but they used only projects that are released under Apache, BSD, and MIT licenses.
  • Wikipedia, covering twenty languages.
  • Gutenberg and Books3, which are two book corpora (public domain).
  • ArXiv, the authors used latex files to add so scientific data to the dataset.
  • Stack Exchange, the dataset contains quality questions and answers (they used only the 28 widest subdomains).
Pre-training data. source: preprint.
Pre-training data. source: preprint.

The dataset was pi tokenized with the byte-pair encoding (BPE) algorithm for a total of 1.4 T tokens.

The architecture of the model is that of a typical transformer, with a few minor modifications as suggested by later work:

  • Pre-normalization to improve training stability. Normalization occurs at each transformer sub-layer instead of only on the output.
  • SwiGLU activation function, as was proposed in PaLM.
  • Rotary Embeddings (as proposed in GPTNeo).
Model sizes, architectures, and optimization hyper-parameters. source: preprint.
Model sizes, architectures, and optimization hyper-parameters. source: preprint.

Model sizes, architectures, and optimization hyper-parameters. source: preprint.

To optimize training, the authors used the most efficient implementation of multi-head attention (xformers library). This allowed them to reduce memory usage. They also reduced the amount of activations that are recomputed during the backward pass.

Training loss over train tokens for the 7B, 13B, 33B, and 65 models. source: preprint.

Results: how David can beat Goliath

The authors tested the model on about 20 benchmarks under different types of conditions:

  • zero-shot, where the authors provide a textual description of the task and a test example. The model must answer the question by generating an open-ended answer or ranking the proposed answers.
  • Few-shot. The model must answer as in the previous task but in this case, the authors provide a few examples of the task (between 1 and 64) and a test example.

The authors compared the results with other models (PaLM, Gopher, OPT, GPT-3, Chinchilla).

“TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.” source: preprint.

The authors then tested on common sense reasoning benchmarks. These datasets contain multiple-choice questions or other types of responses. In addition, the authors tested on trivia questions (closed-book Question Answering) where the model does not have access to the document containing the evidence to answer the question. The model showed comparable accuracy to much bigger models:

“TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.”. Source: preprint.

They then tested with a set of middle school mathematical problems (GSM8k). Here they compared it with PaLM and with Minerva (which is a model that is basically PaLM but fine-tuned on a huge amount of scientific data such as ArXiv articles and exercises). Although the model performs better than PaLM, it does not achieve the results of Minerva (even though LLaMA is not fine-tuned on mathematical data).

Model performance on quantitative reasoning datasets. source: preprint.

In addition, they tested the model’s ability to generate code following a description in natural language. Basically, the model receives a description of the program it has to write in a few sentences (and also some input-output examples). The model must then generate Python code that fits the description and satisfies the test cases. LLaMA performs better than other generalist models although in general, it performs worse than models that were then fine-tuned to code (such as PaLM-coder).

Model performance for code generation. source: preprint.

The authors then compared the model using the massive multitask language understanding benchmark (MMLU) which contains knowledge questions on different domains (from STEM to social sciences). The model has lower results than Chinchilla and PaLM. The authors suggest as an explanation that these models were trained on many more books (LLaMA on 177 GB of books, while the others on a dataset of more than TB). This probably explains why while in other tasks Gopher is comparable to GPT-3, it performs better in this benchmark.

Massive Multitask Language Understanding (MMLU).
Massive Multitask Language Understanding (MMLU).

On the other hand, the use of instruction fine-tuning improves LLaMA’s ability to answer MMLU queries. The authors note that just a very small amount of finetuning is enough to improve the performance of LLaMA on MMLU.

Instruction finetuning — MMLU.
Instruction finetuning — MMLU.

Bias, Toxicity, and Misinformation

One of the main problems with large language models (LLMs) is that they tend to reproduce and even amplify the biases found in the training data. Therefore, these models can generate responses that are also offensive or toxic.

In addition, LLMs are not capable of reasoning, so they often generate responses that are completely wrong and could be used to generate misinformation.

LLaMA has been trained using text that comes from the Web, so it may contain bias. The authors decided to evaluate the model on different benchmarks measuring toxic content production and stereotype detection. For example, the model could generate toxic languages, such as insults, hate speech, or threats.

RealToxicityPrompts is a benchmark dataset consisting of 100k prompts that are used to generate the text. This text then is evaluated with a toxicity score (ranging from 0 nontoxic to 1 toxic). Because the other models are not accessible they could not compare (the results are comparable to those published in the Chinchilla case).

“RealToxicityPrompts. We run a greedy decoder on the 100k prompts from this benchmark. The “respectful” versions are prompts starting with “Complete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores were obtained using the PerplexityAPI, with higher score indicating more toxic generations.” source: preprint.

The authors noted that “toxicity increases with the size of the model, especially for Respectful prompts.” In addition, the authors evaluated the model on the CrowS-Pairs dataset, which allows them to assess biases in nine different categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. The model is presented with several examples that are composed of a stereotype and an anti-stereotype and it must choose between the two (perplexity is measured in zero-shot settings). The authors note:

Our model compares slightly favorably to both models on average. Our model is particularly biased in the religion category (+10 compared to OPT-175B), followed by age and gender (+6 each compared to best model). We expect these biases to come from CommonCrawl despite multiple filtering steps. source: preprint.

Source: preprint.

The authors also used a dataset focused on gender bias (WinoGender), where there is a sentence referring to an occupation, a participant and a pronoun, and the model has to relate (coreference) the pronoun either to the occupation or to the participant (e.g., “The nurse notified the patient that his shift would be ending in an hour.” which is followed by ‘His’ refers to, the model has to relate it to nurse or patient).

we report the co-reference scores for the three different pronouns contained in the dataset. We observe that our model is significantly better at performing co-reference resolution for the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns. Source: preprint.

This is an indication that the model has a gender bias.

TruthfulQA is a benchmark dataset to measure the truthfulness of a model (if the model is able or not to identify when a sentence is true). This is important because you can evaluate with this system if a model is able to create misinformation.

we report the performance of our models on both questions to measure truthful models and the intersection of truthful and informative. Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallucinate incorrect answers. Source: preprint.

Source: preprint.

Finally, the authors also calculated the carbon footprint of the model. As the authors acknowledge, “Training our models consumed a massive amount of energy, which was responsible for the emission of carbon dioxide.”

Source: preprint.

Text generation

The model is also capable of generating text, so the authors have put some of the texts that have been generated by the model in the appendix. Prompts are in bold.

Like any LMs, through the technique used to train ChatGPT, a chatbot can be created. In fact, there is already an open-source version: ChattLLaMA.

Source: preprint.

There are many more in the article and I suggest checking them.

Parting thoughts

Image by Anete Lūsiņa at

Foundation models are used in most artificial intelligence tasks. So far the trend has been to have models with more and more parameters but as we have seen this does not always lead to better results.

Training smaller foundation models like LLaMA is desirable in the large language model space because it requires far less computing power and resources to test new approaches, validate others’ work, and explore new use cases. — source.

LLaMA is important because despite its size it is competitive against much larger models. And they succeeded by using only open-source data for training. The model will be released under a “noncommercial license” and access will be granted to researchers.

In addition, the authors did study in detail the potential biases of the model. Since the data are not proprietary also researchers can study the model and how these biases are originating.

In addition, META plans to release several versions of the model (7B, 13B, 33B, and 65B parameters and a model card that provides details to the community (can be found here and here to request access to the model).

Finally, we plan to release larger models trained on larger pretraining corpora in the future, since we have seen a constant improvement in performance as we were scaling. Source: preprint.

What do you think about it? Are you curious to try it?