Foundation Models

13 min readNov 22, 2021

In August 2021 Stanford announced establishing the Center for Research on Foundation Models (CRFM) as part of the Stanford Institute for Human-Centered AI (HAI). The center then published a large 200+ page report called “On the Opportunities and Risks of Foundation Models” in which the authors set the term “Foundation models”, investigating their capabilities, applications, underlying technology, and related societal impact, as well as setting the stage for the further research.

So, what are foundation models, and why are they important?

What is a foundation model?

First of all, foundation models are not something new. They’ve been with us for several years. These are models like BERT, RoBERTa, T5, BART, GPT-3, CLIP, DALL·E, Codex, and so on. It looks like rebranding, similar to how deep learning became the new brand for good old neural networks. Yet it’s important because foundation models exhibit interesting behavior and represent a paradigm shift.

The authors define a foundation model as any model that is trained on broad data at scale and can be adapted (fine-tuned or using in-context learning) to a wide range of downstream tasks and applications.

A foundation model can centralize the information from all the data from various modalities. This one model can then be adapted to a wide range of downstream tasks. The image from the original paper.

From the technological point of view, these models are large neural networks trained using self-supervised learning.

Both of these elements are not new, what’s really innovative is the scale at which these models are created. Current foundation models have hundreds of billions or even trillions of parameters, and they are trained on hundreds of gigabytes of data.

A number of parameters and dataset sizes for some large language models. The table from the paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜”

These models could process or generate text, images, code, and potentially any other types of data. The foundation model can also be multi-modal, meaning it can work with several modalities (content-types) at once, e.g. CLIP and DALL·E models from OpenAI work with both image and texts.

What is a foundation model from the technological point of view?

Usually, these networks use the transformer architecture, but not necessarily, as it’s just the current state of the art, and other architectures, existing or future ones, could be used as well.

The transformer architecture has a couple of interesting properties that make it suitable to be a backbone of foundation models:

Transformer models are easily parallelizable for both training and inference because, in contrast to a previous wave of state-of-the-art NLP models, the recurrent neural networks (typically in the form of LSTM or GRU) that rely on sequential computations (meaning that co calculate something for the current token of text, you first need to calculate the same function for all the previous tokens), computations inside transformers do not rely on computations for preceding tokens. In each layer of the transformer, computations happen in parallel for all the input tokens.

* You still have a dependency on previous layers of the neural network, but it’s a separate story as almost every modern network architecture has this property. There is an interesting direction of research focused on decoupling computations along several axes, and successes along here could change the status quo.

** There are still some places where transformers become sequential, especially in the case of autoregressive generation, or with different modifications of the original transformer architecture that incorporate recurrence into transformers in some way.

*** This parallelization has its own limits because you need a lot of memory for storing all the intermediate results and the computations themselves scale quadratically with the number of input elements in the original transformer architecture. In practice, you’re limited by a maximum input size (usually called the ‘attention span’) which is typically in a range of 512 (for BERT) to 2048 (for GPT-3) tokens. Recurrent neural networks don’t have such limitations, yet they have other problems (gradient explosion/vanishing) limiting their applicability to really long sequences. There is a separate research direction targeting efficient transformers and transformers for long sequences, and further successes here will make transformers even more appealing.
The transformer architecture has fewer implicit biases than, say, convolutional (CNN) or recurrent (RNN) neural networks. Implicit biases are the design decisions that assume some property in the input data, say, locality of features in CNNs. Fewer implicit biases from the one side make training harder (it requires more training examples to learn something useful), yet make it more universal. This means you just give more data and the model will learn the proper biases itself, and maybe these biases will be better and more suitable for the problem at hand than any hand-designed biases. This story resembles the transition between classical machine learning and deep learning. The learned features happened to be better than the hand-designed ones even for the well-established fields like computer vision (remember “the ImageNet moment”). It’s much easier to approach the new problems this way if you have the data. You don’t have to spend time designing the proper architecture, you can just jump in and iterate. But remember about this trade-off: implicit biases vs. data.

Regarding non-transformer architecture, one example of early foundation models is the ELMo model, a kind of BERT predecessor, built with LSTM neural networks. There is a ResNet-based (a kind of CNN) SimCLR model for images. Some early image-processing CNNs trained on ImageNet like VGG could be treated as early foundation models in the same way, yet they were usually trained with a supervised objective which creates an important practical distinction from the current wave of self-supervised models.

Classical supervised training assumes you have some data labeled with the training signal (the supervision), be it a class label, pixel coordinates for a region on a picture, or the spatial structure of a protein. Typically, these are manually annotated data or the data obtained in an experiment, making the process expensive and poorly scalable.

In self-supervised training, the training signal is derived automatically from unannotated data. This could be a word masked inside a sentence (where the model tries to recover this word during training, as BERT does), just the next character or word in a sentence (GPT-like training), correspondence between an image and its caption (and the model learns a similarity metric to match corresponding image and captions, as CLIP does), or correspondence between parts or different transformations of the same image (as SimCLR does), and so on.

Self-supervised has at least two benefits:

It scales much better than supervised training, as it’s much easier to obtain more unlabeled data than labeled.
It potentially learns more expressive representation, as it’s richer than a more limited label space of supervised data.

The combination of a compute-efficient architecture with a scalable training objective plus powerful hardware gives us the ability to scale models to an unprecedented level.

Why is it called the “foundation model”?

Existing terms like large models, pretrained models, or self-supervised models focus only on the technical dimension but fail to highlight the paradigm shift in accessibility beyond the world of machine learning experts.

The authors also considered terms such as general-purpose model and multi-purpose model, but these failed to capture their unfinished character and the need for adaptation.

So, authors came to the term “foundation models” (not “foundational”) to identify this emerging paradigm. They say:

In particular, the word “foundation” specifies the role these models play: a foundation model is itself incomplete but serves as the common basis from which many task-specific models are built via adaptation. We also chose the term “foundation” to connote the significance of architectural stability, safety, and security: poorly-constructed foundations are a recipe for disaster and well-executed foundations are a reliable bedrock for future applications. At present, we emphasize that we do not fully understand the nature or quality of the foundation that foundation models provide; we cannot characterize whether the foundation is trustworthy or not.

Be careful, it’s not about the foundation of intelligence or something like this. It’s not the main and the only path to AGI (it might not be the path there at all). And it does not mean all the other machine learning, deep learning, and the whole AI community approaches become irrelevant.

Why is it important?

The authors summarize the significance of foundation models with two words: emergence and homogenization.

Emergence means that the model demonstrates behavior that was not explicitly constructed but implicitly induced. One such example is the emergence of in-context learning in the GPT-3 model. The model can be adapted to a downstream task simply by providing it with a natural language description of the task, called the prompt. The model was neither trained for such behavior, nor was it anticipated to arise.

Looking into the history of machine learning, the emergence of high-level features from the raw input inside a trained deep learning network (say, an image processing model working with pixel-level input learns more complex hierarchical features of edges, body parts, and so on) was very similar. Even a classical machine learning approach for solving tasks can be viewed from this perspective: how a task is performed emerges from training examples, meaning it is inferred automatically, not constructed explicitly like an algorithm for solving the task.

It’s essentially the same thing as systems theory and the complexity science study.

Homogenization means the consolidation of methodologies for building machine learning systems across a wide range of applications.

Classical machine learning homogenized learning algorithms, meaning the same learning algorithm (say, logistic regression, SVM, or decision tree) could power a wide range of applications based on data, being the training examples. Then deep learning homogenized neural network architectures (multi-layer perceptrons, convolutional neural networks, recurrent neural networks, self-attention neural networks aka transformers) the same way — rather than having a bespoke feature engineering pipeline for each application, the same architecture could be used for many applications. Now the trend continues, and foundation models homogenize the model itself — one model can be used for many different tasks through adaptation.

This has led to an unprecedented level of homogenization. A huge number of state-of-the-art models in NLP are now adapted from one of the foundation models like BERT, T5, BART, etc. In most practical cases solutions are now not built from scratch, instead, they rely on such models — say, for training a custom text classifier, one of the easiest ways is to start from a BERT-like model and fine-tune it with a rather small amount of data (not comparable to full training of such a model); or don’t fine-tune a model at all, but adapt the GPT-3 like model using in-context learning by designing a proper prompt.

The image from the original paper describing how the story of AI has been one of increasing emergence and homogenization

Homogenization provides strong leverage for many tasks because any improvement in a foundation model can lead to immediate (say, in the case of GPT-3 where no fine-tuning is involved, you modify model behavior by providing it with prompts at its input) or at least quick (when you need to fine-tune the updated model, and it requires some training, yet the training is orders of magnitude smaller than training the full model from scratch) benefits across all the solutions and derived models).

Such homogenization also creates a single point of failure, meaning all the systems built upon a single foundation model inherit the same problematic biases and might be susceptible to a common attack vector if such a flaw is discovered.

Homogenization happens across different levels:

One model can be used for solving different tasks inside a single domain. Say, BERT can be used for text classification among different tasks in NLP: spam detection, sentiment and emotion analysis, content moderation, classifying email threads by topics, and so on. It makes it easier to develop new products at scale.
Similar modeling approaches applied to different domains (say, NLP, computer vision, speech, molecular design, and so on) and different research communities might have a unified set of tools for developing foundation models across a wide range of modalities. This makes the speed of scientific research faster and encourages cross-pollination or the incorporation of good findings from one discipline into a common set of tools that helps all others.
The models themselves are homogenized in the form of multi-modal models. Moving forward, one possibility is that when we have a set of models that fuse all the relevant information about a domain (text data, image and speech data, etc) the total performance improves and is able to solve more complex tasks involving several modalities.

The paradigm shift

Foundation models demonstrate a paradigm shift that will shape future developments in the AI field.

Just a few years ago, for most NLP tasks you had to gather a decent training dataset then train a model for solving the task from scratch. In such a situation you typically could not train a large network as it’s a very data-hungry process. It was almost impossible for many tasks to train a large transformer model because you didn’t have enough data. Even if you managed to collect a large dataset, there was a computing resource problem because a large enough transformer requires a huge computational budget.

Training a BERT-Large size model (330M parameters, a rather small model for today's standards) required a cluster with 64 TPUs running for 4 days. In today's prices for TPU v2 it costs something like 4–5k$ (back then in 2018 it cost ~1.5x more). That doesn’t seem like a big deal for an established company, yet it prevents many enthusiasts from doing this kind of research (not to mention modern billion-size parameter models). It’s hardly possible to get a good enough model the first time, so you have to run different experiments, try different hyper-parameters and so on, easily making this number 10x+ higher.

Finally, you have to have strong engineering and ML/DL expertise to make it happen, which limits you if you are not Google. There’s still a lack of ML engineers and not every company (especially small ones) can afford to gather such a strong engineering team.

The situation has changed completely. There exist large models pre-trained on large quantities of text that were made publicly available. Models like BERT are easily available and modern libraries like Huggingface Transformers lowered the barriers even further.

Now for solving a task you still have to collect a training dataset, but when fine-tuning a pretrained model like BERT the dataset can be much smaller orders compared to training a model from scratch, say tens of thousand examples can be sufficient. Computational budgets can be smaller as well, not only because you have less data, but also because you already have a model with good representations and during the fine-tuning process you cater these representations for your task, increasing the speed. You still need some ML expertise, but the process is pretty standard, with a lot of tutorials and very friendly libraries. As a result, you can have a really good model having spent only tens to hundreds of dollars training it, getting rid of significant expenses on collecting a dataset and gathering a strong team. That significantly changes the financial situation.

With cloud APIs it’s even cheaper and simpler, as a big part of engineering complexity is outsourced to the cloud.

With the latest models like GPT-3 that support in-context learning, you don't even have to collect a dataset and train anything — you adapt the model by feeding it text prompts that “convince” the model to perform the task you need. The most time now is spent designing the prompts, and for this task you don’t need ML/DL expertise. Together with cloud-based APIs, you can create a working solution without knowing anything about developing and deploying DL models. You still need to integrate this solution into the systems where you want to use it (and that could be a pain on its own), and you still need to manage all these configured models (which can soon become a mess if you approach it in a wrong way), but (a minute of advertisement) this is exactly the problems Intento solves.

The next wave of AI democratization is happening right now.

The Ecosystem

Foundation models are not standing alone, and we must consider the full ecosystem that these models inhabit, not only the training and adaptation steps.

We can think of it in terms of a sequence of stages, from data creation to deployment, where people occupy both ends of the spectrum. So, many questions related to social impact naturally arise.

Foundation models are part of a broader ecosystem that stretches from data creation to deployment. Image from the original paper.

All data has an owner and is created with a purpose that might not include training a foundation model. Curation introduces its own biases and is an important step on its own.

The ecosystem view emphasizes that many of the impacts come from decisions made in pipeline stages other than foundation model training/adaptation. Each stage can potentially be performed by a different organization or entity, which makes the situation more complex.

The Future

We are still in the early days of foundation models, and there are many open questions and potential risks (it’s hard to quickly describe them, they deserve a separate post, and that’s exactly the reason to read the original paper). Yet the opportunities are very promising, the economics around AI solutions has already changed in the last few years, and it will continue to evolve in the future.

What we can say for sure?

This particular wave started in the NLP field, and it will spread to all the other fields and cover all the other content types: images, sound, videos, genomic and protein sequences, sensor data, and so on.

Multi-modality is an important factor and we’ll see the emergence of multi-modal models with better performance than the single-modal solutions.

There probably will be a gap between a countable number of heavy-resourced companies that can train such models and all the others who cannot afford training such solutions on their own, instead relying on published or productized solutions from former companies.

There will hardly be a single company owning all the models. For the case of GPT-3, we already have several options, from the original OpenAI’s one to A21 Labs Jurassic-1 models, or Eleuther AI GPT-Neo and GPT-J models, and non-English models like the Russian one ruGPT-3 by Sber, Korean HyperCLOVA by Naver, Chinese CPM-1/CPM-2 models by Tsinghua University, PanGu-α by Huawei, Wu Dao 2.0 by the Beijing Academy of Artificial Intelligence, and that’s just the beginning.

The line between AI developers and users will become even more blurry, as more people without ML skills will be able to successfully adapt foundation models for their own cases, and we’ll see the Cambrian explosion of new AI-based products solving real needs.

Resources

The original paper: “On the Opportunities and Risks of Foundation Models”.
The Center for Research on Foundation Models (CRFM)
Recordings from the Workshop on Foundation Models from August 23–24, 2021
Commentaries Responding to “On the Opportunities and Risks of Foundation Models”
The authors’ reflections on the community response: “Reflections on Foundation Models” from October 2021