One of the most intriguing aspects of scale is emergent abilities, in which larger models succeed at tasks that smaller ones could not. This phenomenon has been particularly fascinating in LLMs, where models show promising results on a broader range of tasks and benchmarks as they grow in size.
2. Unsupervised learning is still effective.
In recent years, there has been tremendous progress in this field, particularly in LLMs, which are mostly trained on large sets of raw data gathered from the internet. Even though LLMs kept getting better in 2022, other unsupervised learning techniques started to catch on.
This year, for example, there were tremendous advances in text-to-image models. Unsupervised learning has been demonstrated by models such as OpenAI's DALL-E 2, Google's Imagen, and Stability AI's Stable Diffusion. Unlike previous text-to-image models, which required well-annotated pairs of images and descriptions, these models make use of large datasets of loosely captioned images already available on the internet. Because of the sheer size of their training datasets (which is only possible because no manual labeling is required) and the variability of their captioning schemes, these models can find all kinds of intricate patterns between textual and visual information. So, they are much more flexible when it comes to making pictures for different purposes.
Multimodality is critical to the intelligence found in humans and animals. For example, if you see a tree and hear the wind rustling in its branches, your mind will quickly associate the two. Similarly, when you hear the word "tree," you may immediately recall an image of a tree, the smell of pine after a rain, or other previous experiences.
Clearly, multimodality has played an important role in increasing the flexibility of deep learning systems. DeepMind's Gato, a deep learning model trained on a variety of data types including images, text, and proprioception data, was perhaps the best example of this. Gato performed well in a variety of tasks, including image captioning, interactive dialogues, controlling a robotic arm, and playing games. In contrast, traditional deep learning models are designed to perform a single task.
Some researchers have gone so far as to suggest that a system like Gato is all that is required to achieve artificial general intelligence (AGI). While many scientists disagree, one thing is certain: multimodality has resulted in significant advances in deep learning.
These are some of the mysteries of intelligence that scientists in various fields are still investigating. Pure scale-and data-based deep learning approaches have aided in making incremental progress on some of these issues, but have yet to provide a definitive solution.
Larger LLMs, for example, can maintain coherence and consistency over longer text stretches. They don't do well, though, when it comes to tasks that need careful step-by-step thinking and planning.
Similarly, text-to-image generators produce stunning graphics but make basic errors when asked to draw images with complex descriptions or those that require compositionality.
These challenges are being discussed and investigated by a variety of scientists, including some of the deep learning pioneers. Among them is Yann LeCun, the Turing Award-winning inventor of convolutional neural networks (CNN), who recently published a lengthy essay on the limitations of LLMs that learn solely from text. LeCun is working on a deep learning architecture that learns world models and can address some of the problems that the field is currently facing.
Deep learning has progressed significantly. However, as we make progress, we become more aware of the difficulties in creating truly intelligent systems. Next year will undoubtedly be as exciting as this one.