What is Meta-Learning? Benefits, Applications and Challenges
Jan Macdonald (PhD)
Data-driven algorithms, such as machine learning and particularly deep learning models, have achieved unprecedented successes in diverse application areas, ranging from computer vision to audio and signal processing to natural language processing. Most commonly, machines “learn” to solve a specific task in a supervised manner by observing a large amount of labeled example data. Think of an image classification model that learns to distinguish different animals by being presented with many example images of each different animal type. This differs significantly from the way we humans tend to learn: After having been exposed to recognizing different animals repeatedly throughout our life, we are able to learn the concept of a new type of animal after seeing only very few examples. Incorporating such “adaptive” learning strategies into the field of machine learning is at the core of meta-learning. This was already explored in the 1980s and 1990s, e.g., by Schmidhuber (Schmidhuber, 1987) and Bengio et al. (Bengio et al., 1991). Recently, with the rapid improvements in deep learning, the interest in neural network based meta-learning approaches has increased and a wide range of variants have been proposed and developed. We will take a more detailed look at a selection of them below.
What is meta-learning?
In a nutshell, meta-learning refers to the ability of “learning to learn”. A more comprehensive definition would describe meta-learning as any system that includes a learning subsystem that is dynamically adapted by exploiting experience from previous learning episodes or from different tasks (Lemke et al., 2015). Hence, in contrast to traditional supervised learning, meta-learning does not only consider a single task with a fixed and large dataset but rather a collection of related tasks (each with its own and often much smaller task-specific dataset). It aims at extracting general information about the learning process on individual tasks in order to improve adaptability to novel tasks.
What are the benefits of meta-learning?
A key advantage of meta-learned systems is their flexibility and potential for fast adaptation from little data. This can help to overcome many drawbacks of traditional machine learning algorithms, such as the need for large datasets, high training costs, substantial efforts due to many training trials and the necessity for extensive hyperparameter tuning, and long training times. In contrast, meta-learning significantly lowers the requirement for large amounts of task-specific training data, which is especially important in applications where high quality labeled data is scarce or expensive or time consuming to obtain. This decreased demand for data also comes with a reduced training time and cost during the task adaption. Finally, a well-trained meta-learned system is a generalized model that can be efficiently used to solve multiple related tasks instead of just one. It can also achieve higher prediction accuracies on individual tasks by exploiting insights from the other tasks.
What are the applications of meta-learning?
Due to their flexibility, meta-learning systems can be used in many different applications, ranging from computer vision to language and speech processing to reinforcement learning. The most prominent examples in computer vision are few-shot image classification and few-shot object detection, i.e., the classification or detection from only very few example images per class or per object. Similarly, meta-learning can be used for language processing via few-shot learning for word predictions or machine translation. Meta-learning approaches are also being used in reinforcement learning due to their inherent requirement for exploiting previous experiences and navigating different and changing environments, e.g., in robotics or autonomous driving.
At dida we are currently exploring and comparing different meta-learning approaches for a remote sensing and earth observation application as part of the “PretrainAppEO” project (joint with TU Munich, funded by the German Federal Ministry for Economic Affairs and Energy (BMWI)). In this case, the goal is a crop type classification from time-resolved satellite imagery of agriculturally used areas. Meta-learning is a promising candidate to leverage earth observation insights from geographical regions with more densely available data when adapting to tasks in regions for which only little data is available.
What are current meta-learning approaches?
Current meta-learning approaches can be broadly grouped into three categories.
Model-based meta-learning: Cyclic or recurrent neural network models (e.g., LSTMs) with internal or external memory can adapt their state by reading in a short sequence of task specific training data. The overall model parameters are learned over many different tasks. Examples in this category include memory-augmented neural networks (Santoro et al., 2016) and neural attentive meta-learners (Mishra et al., 2018).
Metric-based meta-learning: Effective and task-adapted distance metrics, e.g., through learned neural network embeddings, are combined with non-parametric techniques, such as nearest neighbor classification, during inference. Examples include prototypical networks (Snell et al., 2017), matching networks (Vinyals et al., 2016), and relation networks (Sung et al., 2018).
Optimization-based meta-learning: Insights about the optimization processes of training a model on different tasks are inferred and aggregated. This allows for a joint optimization of hyperparameters and parameter initializations across tasks and thus for faster adaptation during individual task specific training. The most prominent example is the model-agnostic meta-learning (MAML) approach (Finn et al., 2017).
In the following, we will focus on the third category, as it is most widely applicable and independent of the chosen neural network architecture.
A closer look at optimization based meta-learning
While ordinary supervised learning is about learning an approximation of a data distribution for a single fixed task, meta-learning in contrast aims at learning to learn, i.e., adapting a learning algorithm over multiple learning episodes with different related tasks to improve the learning results on future tasks. This can be formulated as a bilevel optimization: instead of a single dataset, assume that we have access to a distribution of tasks $$p(\mathcal{T})$$ with each task $$(\mathcal{D}_\text{train}, \mathcal{D}_\text{test}, \mathcal{L}) \sim p(\mathcal{T})$$ consisting of a training dataset $$\mathcal{D}_\text{train}$$ (called support set), a testing dataset $$\mathcal{D}_\text{test}$$ (called query set), and a task-specific loss function $$\mathcal{L}$$ (this could also be the same loss for all tasks). A parametrized machine learning model $$M[\theta]$$, e.g., a neural network, with learnable parameters $$\theta$$ can be trained on one such task via
where $$\mathcal{A}$$ refers to the chosen training algorithm, e.g., mini-batch SGD, and $$\omega$$ are hyperparameters of the learning algorithm, e.g., learning rate, architecture hyperparameters such as the number of layers, and initial model parameters $$\theta_0$$. The model can be evaluated for this task on the test set according to the loss $$\mathcal{L}(M[\theta^\ast], \mathcal{D}_\text{test})$$. During the meta-learning process the hyperparameters $$\omega$$ (or parts thereof) are optimized over all tasks,
where $$\mathcal{L}_\text{meta}$$ is some suitable loss function for the meta-learning objective. A common meta-learning setup is $$n$$-way $$k$$-shot classification, where each $$\mathcal{D}_\text{train}$$ contains a total of $$n\cdot k$$ samples from $$n$$ different classes ($$k$$ samples per class), the same classification loss is used for all tasks, e.g., softmax cross-entropy, and the meta-objective considers the classification evaluation on individual tasks, i.e.,
In practice, the expectation over all tasks in the meta-learning objective is replaced by a sample of tasks used for estimating $$\omega^\ast$$ (meta-train set) and a separate set of tasks can be held back for evaluating the meta-trained model afterwards (meta-test set), similar to standard supervised training dataset splits, see Figure 1 for an example.
A popular meta-learning algorithm that fits into this framework is model-agnostic meta-learning (MAML). The main idea of MAML is to explicitly use a fixed number of SGD steps for the learning algorithm $$\mathcal{A}$$, i.e., starting from some initialization $$\theta_0$$ it proceeds by setting
where $$T$$ is the chosen number of task adaptation steps (this is called the inner loop of MAML). Further, MAML restricts the meta-trainable parameters $$\omega$$ to include only the model parameter initialization $$\theta_0 = \omega$$. All other hyperparameters, e.g., the learning rate $$\alpha$$, are externally set and not meta-learned. The optimization of the meta-parameters is also iterative and gradient-based (this is called the outer loop of MAML). For the simplest case of $$T=1$$ adaptation steps the MAML algorithm can be summarized with the following update rule, per task from the meta-train set,
where $$\beta$$ is the learning-rate for the meta-training (called meta-learning-rate).
Several variations of MAML have been proposed, see also Figure 2:
FOMAML (First-Order MAML) (Finn et al., 2017): Standard MAML can quickly become computationally expensive due to the necessity of computing second-order derivatives. FOMAML simply ignores all second-order derivatives arising from the chain rule in the MAML update rule.
ANIL (Almost No Inner Loop) (Raghu et al., 2020): Splitting the model parameters $$\theta=[\theta_\text{backbone}, \theta_\text{head}]$$ into a model backbone and a classification head (final layer), it was observed that $$\theta_\text{head}$$ is usually updated much faster during the task adaptation compared to $$\theta_\text{backbone}$$. ANIL explicitly only updates $$\theta_\text{head}$$ during the $$T$$ adaptation steps, thus reducing computational costs.
Reptile (Nichol et al., 2018): Also with the goal of scalability and also removing the need for second-order derivatives Reptile simplifies MAML by replacing the meta-learning update direction $$\nabla_\theta\mathcal{L}_\text{meta}$$ with the direction $$(\theta_T-\theta_0)$$ from the start to the end of the $$T$$ task adaptation steps.
Meta-SGD (Li et al., 2017), Alpha-MAML (Behl et al., 2019), ALFA (Baik et al., 2020): Several other variations combine MAML with ideas to meta-learn additional hyperparameters, most notably the adaptation learning rate $$\alpha$$ and meta-learning-rate $$\beta$$, thus removing some manual parameter tuning that is required for vanilla MAML.
What are the current challenges and future directions of meta-learning?
While traditional supervised learning typically requires a large labeled dataset, meta-learning can deal with much fewer labeled samples per task. However, this is based on the assumption that a sufficient number of meta-training tasks with enough but not too much task variability is available. This is not always the case and high quality meta-train data can be hard to come by in many applications. If the task variability in the training data is too low, then a meta-learned system will eventually be able to solve individual tasks without further task-specific adaptation. In this case the meta-learning “saturates” and the system will not achieve its potential to flexibly adapt to unseen tasks or generalize to slightly off-distribution tasks (meta-overfitting). On the other hand, if the task variability is too high, then any insights gained from one task might not transfer reasonably well to another task. In this situation, a single meta-learned system might not be able to adaptively achieve high accuracy on all tasks, and in fact aiming to achieve this goal could even hinder its performance on all tasks (meta-underfitting). A similar phenomenon has also been observed and analyzed in multi-task learning but is not yet well-studied in the context of meta-learning.
Another challenge in meta-learning is the computational cost during the meta-learning (pretraining) phase (in contrast to the final task-adaptation and inference, which is designed to be computationally much cheaper). This is quite clearly recognizable when considering the bilevel optimization formulation above and MAML in particular: It is more compute expensive compared to standard supervised training via SGD in terms of time (each outer loop update requires multiple inner loop update steps) as well as in terms of memory (outer loop updates require second-order derivatives and the automatic-differentiation by backpropagation through all inner loop steps requires storing the intermediate results for all of them). We have already discussed some MAML variations that aim at circumventing some of these challenges.
Finally, an interesting observation is that meta-learning methods (which aim to learn meta-information about learning algorithms) are themselves data-driven and thus learned algorithms, which have their own set of hyperparameters and adjustable controls that could be attached as meta-information about a meta-learning system. This raises the question: Why should one stop at the “simple” meta-learning level? While it is considered mostly impractical due to a lack of suitable data and compute resources, it is in principle imaginable to extend the meta-learning idea to further meta-levels—think of meta-meta-learning (learning about learning to learn).
Conclusion
There has been a growing interest in neural network based meta-learning, particularly in application areas where little task-specific labeled data is available. The main idea of meta-learning is the leveraging of meta-information collected from the learning processes across multiple related tasks in order to improve the individual learning on all of these tasks (learning to learn). This was originally motivated by the way humans are able to learn quickly through relying on lifelong previous experiences and, in contrast to traditional supervised machine learning systems, only require very little example data in order to learn and understand a new concept. Various different approaches to meta-learning machine learning have been proposed and we have discussed a selection of them in more detail. This is still a very active research field and our short overview does not claim to be an exhaustive summary by any means. There are many open questions and opportunities for applications of meta-learning and we encourage everyone to explore the field of meta-learning further.
Frequently Asked Questions
How is transfer learning related to meta-learning?
Transfer learning involves applying knowledge from one task (often rather general with a large labeled dataset) to another task (often more specific with fewer labeled data). This is typically done by using the pre-trained model parameters from the general task as initialization for the finetuning of model parameters for the specific task. Meta-learning in contrast focuses on learning strategies for quickly adapting a model to multiple equally specific tasks.
How is multi-task learning related to meta-learning?
Both multi-task and meta-learning have the goal of improving the performance of a learned model by leveraging shared insights from multiple related tasks. Their main difference is that multi-task learning typically considers learning all tasks simultaneously, while meta-learning allows for sequential learning episodes with task-specific adaptations.
How is ensemble learning related to meta-learning?
Ensemble learning methods seek to improve the predictive performance for a single task by combining multiple learned systems (called base learners) into a more powerful model. On the other hand, meta-learning aims at improving predictive performance by learning how to adapt a model for different related tasks. In some cases, meta-learning can be used in combination with ensemble strategies, e.g., meta-learning how to best combine multiple base learners.
How is AutoML related to meta-learning?
Automated machine learning (AutoML) refers to the general process of automating machine learning applications. Meta-learning can be considered as just one possible technique used within AutoML. However, AutoML also uses many other approaches such as (Bayesian) hyperparameter optimization, automated feature extraction and selection, or neural architecture search (NAS).
References
Baik, S., Choi, M., Choi, J., Kim, H., & Lee, K. M. (2020). Meta-Learning with Adaptive Hyperparameters. Advances in Neural Information Processing Systems, 33.
Behl, H. S., Baydin, A. G., & torr, P. H.S. (2019). Alpha MAML: Adaptive Model-Agnostic Meta-Learning [arXiv:1905.07435].
Bengio, Y., Bengio, S., & Cloutier, J. (1991). Learning a synaptic learning rule. IJCNN-91-Seattle International Joint Conference on Neural Networks, ii, 969.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 1126-1135.
Lemke, C., Budka, M., & Gabrys, B. (2015). Metalearning: a survey of trends and technologies. Artificial intelligence review, 44(1), 117-130.
Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-SGD: Learning to Learn Quickly for Few-Shot Learning [arXiv:1707.09835].
Mishra, N., Rohaninejad, M., Chen, X., & Abbeel, P. (2018). A Simple Neural Attentive Meta-learner. International Conference on Learning Representations.
Nichol, A., Achiam, J., & Schulman, J. (2018). On First-Order Meta-Learning Algorithms [arXiv:1803.02999].
Raghu, A., Raghu, M., Bengio, S., & Vinyals, O. (2020). Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML [arXiv:1909.09157].
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, 48, 1842-1850.
Schmidhuber, J. (1987). Evolutionary Principles in Self-Referential Learning [Diploma Thesis]. Technische Universität München, Germany.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical Networks for Few-shot Learning. Advances in Neural Information Processing Systems, 30.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S., & Hospedales, T. M. (2018). Learning to Compare: Relation Network for Few-Shot Learning. IEEE Conference on Computer Vision and Pattern Recognition, 1199-1208.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. Advances in Neural Information Processing Systems, 29.