The History & Anatomy of AI Models

The Gist

Historical context. Generative AI history highlights the evolution from large language models to foundational models.
Technical nuances. Generative AI, LLMs, and foundational models have distinct functions and scopes, rooted in neural networks.
AI landscape. Rapid advancements in LLMs like GPT-4 and competition from global tech giants are shaping the future of AI.

In this second part of our series on generative AI history, we will look under the hood to understand the foundations of generative AI and how these large language models (LLMs) work their magic. We will also explore the current landscape of the AI chat platforms and understand where they are headed.

Inside CMSWire's Artificial Intelligence Series

Part 1 discussed a detail history of artificial intelligence that spans decades.
Part 2 today discusses the foundations of generative AI, a record-breaking technology that hit the mainstream in November 2022 with OpenAI's release of ChatGPT.

Generative AI History: Understanding the Mechanics of Language Models

The terms generative AI, large language models (LLM), and foundation models are now being used everywhere and are often used interchangeably despite meaning different things.

Generative AI is a sub-segment of the broader general AI discipline, specifically deep learning, and is distinguished by the ability to produce new content, chats, synthetic data or even deepfakes. It typically starts with a user prompt and can iterate to explore various content responses. Therefore, generative AI typically refers to AI systems whose primary function is to “generate” content. These can span a variety of AI systems such as image generators (Midjourney, DALL-E or Stable Diffusion), large language models (GPT-4, PaLM) or code generation tools (CoPilot).

Language models are a class of general purpose, probabilistic models explicitly tailored to identify and learn statistical patterns in natural language. They are typically based on neural networks. Large language models (LLMs) are specific language models consisting of neural networks that have been trained on large amounts of text. LLMs differ from natural language processing (NLP) as LLMs consider larger sequences of text to generate better context and ultimately better more relevant responses. While there is no specific definition, a “large” language model typically refers to a neural network with tens of millions to billions of parameters.

Foundational models are technically the superset of LLMs, which are focused on language predictions. Foundational models were popularized at Stanford University’s Human-Centered Artificial Intelligence's (HAI) as more general purpose solutions that could be adapted to a broad range of solutions. A foundation model is trained on broad data using self-supervision that can be adapted to a wide range of downstream tasks (hence the concept of a “foundation”).

LLMs consist of several important building blocks. Tokenization is the process of converting text into tokens that the model can understand. Embedding then converts these tokens into vector representation that can be processed by the model. Attention mechanisms allow the model to weigh the importance of the phrase in a given context. Pre-training is the process of training the LLM on a large dataset, usually unsupervised or self-supervised. Transfer learning is the technique used to fine-tune the model to achieve the highest performance on specific tasks.

It's important to note that LLM’s are not “fact machines” that answer questions directly. The basic premise of a language model is its ability to predict the next word or sub-word (which are called tokens) based on the text it has observed so far in the learning data sets. Typically, it is the token with the highest probability that is used as the next part of the response.

It's also important to recognize that LLMs are narrowly focused and fundamentally a predictive model around text, language and more recently image data. At the core, it’s a language predictor that gives you a sentence or paragraph based on specific inputs. On the surface, it seems like a profound advancement in AI as it’s mimicking the natural way humans interact with one another.

Applications like ChatGPT are basically a large language model developed to generate text. It is optimized for dialogue with its user and learns from human demonstrations, which is a method of learning called reinforcement learning with human feedback (RLHF).

So, while this has largely been heralded as a new era in AI, the fact is that this has been a steady evolution resulting from the digitation of assets and the accumulation of data which is increasing by orders of magnitude every year. This has greatly enhanced our ability to make predictions and build predictive models. The general trajectory of machine learning and data generation and our ability to make better predictions is continuing in the way it has been over the past couple of decades.

Related Article: How to Pick the Right Flavor of Generative AI

Transformers Get All the Attention They Need

The transformer architecture was first introduced in 2017 in a Google paper called “Attention Is All You Need” and significantly changed the way we think about models.

A transformer is a deep learning model that is based on self-attention, which is a machine learning technique that mimics cognitive awareness and devotes more focus to important parts of the data and captures relationships between different elements of sequential data, for instance the words in a sentence.

The attention mechanism in natural language processing is one of the most valuable developments in deep learning in the last decade. The attention mechanism replaced the previous recurrent neural network (RNN) encoder/decoder translation systems. Previously NLP models leveraged supervised learning from large amounts of manually labeled data which limited their use on data sets that were not well annotated. This process was also very time consuming and extremely expensive.

Attention-based systems in machine learning can be thought of in three parts — a process that reads the raw data and converts them into a distributed representations (i.e., a word position and a feature vector), a list of feature vectors to determine the sequence or order and a process that correlates the content of the previous step against each current element.

Most current foundational models use a transformer architecture. Transformers are computationally efficient. Calculations occur in parallel, which is a huge advantage over typical recurrent networks that must operate sequentially. Training is easier with transformers since the number of parameters is a lot more in RNN-based models.

The transformer architecture is now the de facto standard for deep learning applications like natural language processing, computer vision, audio, speech and many more. Transformer networks give the best accuracy, come with less complexity and computational cost. It’s now easier than ever to build tools and models across many use cases.

LLMs' Rapid Evolution Fuels the Future of AI

Over the past decade, LLMs' rapid evolution and breakthroughs have transformed natural language processing, unlocking new possibilities for businesses to enhance efficiency, productivity, and customer experience.

ELMo (Embeddings from Language Model), created by the Allen Institute for Artificial Intelligence, was one of the first foundational models based on LTSM (Long Short-Term Memory) technology. This allows embeddings that are context-sensitive, producing different representations for words that share the same spelling but have different meanings. Previous language models such Glove, Word2Vec, etc. only produced an embedding based on the spelling of the word but not the context of how the word was being used. ELMo was also one of the first truly “large” language models that was trained on a text corpus of over 5.5 billion words.

In 2018, the first LLMs to arrive on the scene using the transformer architecture were generative pre-trained transformer (GPT) in 2018 and the Bidirectional Encoder Representations from Transformers (BERT), a family of language models introduced by researchers at Google.

BERT is a bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. GPT uses supervised learning and reinforcement learning from human feedback. It interprets user prompts and then generates sequences of responses based on the data it was translated.

While both GPT and BERT are based on the transformer architecture, they are based on different encoding/decoding architectures. The difference is that encoders look in both directions while encoding the data while the decoders will look either into the initial tokens or the tokens after it while vectorizing/predicting the current token.

BERT has 340 million parameters and is an encoder only model. These are designed to produce single predictions per input sequences making them ideal for classification tasks but not great for text summarization tasks. GPT is a decoder-only model that is autoregressive, which means each step of the output is fed into the corresponding input.

While both are highly versatile models, BERT excels at sentiment analysis, named entity recognition and answering questions. GPT excels at content creation, summarizing text and machine translation. While both have been trained on large scale volumes of text, GPT could possibly be trained on the largest dataset ever amassed.

Learning Opportunities

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

On demand

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Watch Now

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

GPT has undergone rapid evolution over the past five years. GPT-1 was first introduced in June 2018 and was the first iteration of OpenAI’s large language model. GPT-1 introduced the transformer architecture and demonstrated the power of unsupervised learning in language understanding tasks, using books as training data to predict the next word in a sentence. The model was prone to generating repetitive text and struggled to track long term dependencies in text.

GPT-3 was released in June 2020 and trained on over 500 billion tokens, which are common-sequences of characters found in text. This allows the model to predict plausible follow-up text responses. The training data came from massive amounts of data from books, articles, documents as well as content scraped from the open internet. It also made use of The Common Crawl, which is a nonprofit organization that includes billions of web pages and is one of the largest text datasets available.

GPT-4 launched in March 2023, and enabled a wider variety of multimodal data inputs such as image, documents, screenshots and even handwritten snippets. GPT-4 also provides considerable performance updates and can understand more complex prompts. It has scored very high on academic tests such as SAT, GRE and BAR exams and has also generated the highest scores on reasoning benchmarks such as MMLU, AI2 Reasoning Challenge (ARC), WinoGrande, HumanEval and Drop. GPT-4 also provides better information synthesis, stronger coherence and creativity, better problem-solving capabilities, more advanced linguistic expression to match the input sentiment, and stronger AI governance to avoid offensive or harmful content.

GPT-5 is already underway and while no formal date has been committed by OpenAI, you can expect it could enhance many of the current capabilities in areas such as enhanced language comprehension, understanding of language nuances like sarcasm and irony, and the ability to provide more human like responses. GPT-5 is likely to have many more parameters than GPT-4 but that may not be as important as it is to understand how best to govern and control how these models are used in the future. Considerations around responsible AI will have to be part of the core platforms moving forward.

Generative AI History Catches Up: The Killer AI Apps Finally Take Us by Storm

AI has always suffered from a lack of ease of use and accessibility. While it powers many of the applications we use every day, getting access to it from your desktop or mobile phone was not easy.

That all changed when ChatGPT launched. It marshaled in a new breed of AI applications that provide conversational interfaces to assist in a wide variety of tasks such as essay writing, code creation, summarizing content and many others.

ChatGPT, was initially a web application based on GPT-3.5. It quickly generated over 100 million users in two months (by January 2023) making ChatGPT the fastest growing app of all time. For comparison, it took nine months for TikTok to reach 100 million users, the previous fastest growing app.

ChatGPT makes use of both natural language processing (NLP) and deep learning neural network. NLP is the technology behind the chatbot technology to provide users with a more natural and conversational experience. The neural network is a complex, weighted algorithm modeled after the human brain that allows it to learn patterns and relationships in data to predict what text should come next in a sentence to formulate a response that is more human-like.

Bard was launched by Google on Feb. 6, 2023, in response to ChatGPT’s incredible success. While it was an entirely new concept, the AI chat service that was launched was powered by Google's Language Model for Dialogue Applications (LaMDA), which was unveiled two years prior. It is based on the LaMDA technology and has the potential to provide up-to-date information, unlike ChatGPT, which is based on data collected only up to 2021.

AI chatbot’s are not just getting attention in the US, it’s a global race to now launch platforms.

Baidu, the Chinese search engine giant, released its AI chatbot called Ernie Bot (Enhanced Representation through Knowledge Integration) in March of 2023. Initial reports indicated that Ernie was not as sophisticated as ChatGPT and that it performs better in Chinese than English language. And topics such as politics proved tricky for Ernie, given that China is subject to strict censorship.

However, Baidu recently stated that the latest release (Ernie 3.5) outperformed OpenAI’s ChatGPT and GPT-4 in several key area such as Chinese language, reasoning and code generation.

Another Chinese tech giant Huawei is also working on its own AI chatbot, PanGu Chat, named after its LLM with an outstanding 1 trillion parameters and is expected to be launched in July 2023. The PanGu model is the industry’s largest and supports on-demand extraction. It excels in intelligent inspection and smart logistics, with improved small-sample learning capabilities and leading performance in the industry. PanGu Chat is expected to only be initially available for business and the company is reportedly planning to sell it to the government and enterprise customers.

Our generative AI history perspective shows it’s clear the global AI race is now fully raging.

Up next in this series: we will look at how the major players are emerging in the global AI race.

fa-solid fa-hand-paper Learn how you can join our contributor community.