Meta has unveiled its latest breakthrough in large language models – the Llama 3 series of open source generative AI models. The company has initially released two powerful models in this new family – Llama 3 8B with 8 billion parameters and Llama 3 70B with a massive 70 billion parameters.
Meta Claims Llama 3 is Best-in-Class on Key AI Benchmarks
According to Meta, these models represent a “major leap” in performance compared to their predecessors, the Llama 2 8B and 2 70B. The tech giant boldly claims that for their respective scale, Llama 3 8B and 3 70B are among the most capable generative AI models available today.
This audacious assertion is backed up by the new models’ standout scores on several popular AI benchmarks used to evaluate skills like knowledge, reasoning, skill acquisition and code generation abilities. These include MMLU, ARC, DROP, GPQA, HumanEval, GSM-8K, MATH, AGIEval and BIG-Bench Hard.
Beating Mistral, Gemma, Gemini and Claude on Multiple Tests
On at least nine of these benchmarks, the 8 billion parameter Llama 3 8B outperforms other prominent 7B open source offerings like Mistral’s Mistral 7B and Google’s Gemma 7B. Meanwhile, the larger 70B version proves competitive with vaunted commercial models such as Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 series on certain tests.
Specifically, Llama 3 70B bests Gemini 1.5 Pro on MMLU, HumanEval and GSM-8K. While it doesn’t quite match Anthropic’s highest performing Claude 3 Opus, it outscores the weakest Sonnet model in that series on 5 benchmarks – MMLU, GPQA, HumanEval, GSM-8K and MATH.
Meta even developed its own in-house evaluation covering use cases like coding, writing, reasoning and summarization. Unsurprisingly, Llama 3 70B came out on top against Mistral Medium, OpenAI’s GPT-3.5 and Claude Sonnet – though the company acknowledges these results should be viewed with some skepticism given the test’s provenance.
Beyond benchmark metrics, Meta promises Llama 3 users will enjoy qualitative benefits like enhanced “steerability”, lower refusal rates for innocuous queries, and higher accuracy on knowledge areas like trivia, history and STEM fields. The models should also provide better general coding recommendations.
Trained on 15 Trillion Token Dataset with Code and Multilingual Data
These improvements are enabled by an astronomical training dataset of 15 trillion tokens (750 billion words) – seven times larger than Llama 2’s dataset. Drawn from “public” web sources, it contains four times more code samples and data from around 30 non-English languages to bolster multilingual performance. Controversially, Meta also utilized synthetic AI-generated data to create longer training examples, a technique with potential downsides.
While vague on precise data sources due to legal concerns over copyrighted material, Meta confirms no user data from its platforms like Facebook and Instagram was included. The company previously landed in hot water for improperly using copyrighted ebooks to train AI models against legal advice.
To mitigate issues like toxicity, bias and hallucinations that plagued Llama 2, Meta developed enhanced data filtering pipelines and updated its Llama Guard and CybersecEval safety tools. A new Code Shield tool aims to detect insecure code from generative AI. However, the limitations of such filters mean real-world performance remains to be seen through third-party testing.
Roadmap: Multilingual, Multimodal Models Over 400B Parameters
The Llama 3 models currently support English output only, despite training on multilingual data. However, Meta has big plans to create larger 400+ billion parameter multilingual, multimodal versions capable of understanding images, video and other modalities alongside text – putting Llama 3 on par with cutting-edge models like Hugging Face’s Idefics2.
Meta AI With Llama 3 Now Live Across Facebook, Instagram, WhatsApp
Llama 3 is already powering Meta’s flagship “Meta AI” assistant, recently integrated into the search bars and messaging apps of Facebook, Instagram, WhatsApp and Messenger across over a dozen countries including the US, Canada, Australia and parts of Africa. The AI can respond to queries, provide web search results, generate images and even animate or convert images to GIFs – with claimed improvements in areas like text rendering in visuals.
However, the widespread deployment across so many apps also raises concerns around content moderation, as large language models are known to “hallucinate” and produce nonsensical or inappropriate outputs. Meta acknowledges it is continuously updating the models to improve.
Coming Soon: Llama 3 on Cloud Platforms With Hardware Optimizations
Looking ahead, Meta plans to roll out managed Llama 3 hosting across major cloud platforms like AWS, Google Cloud and Microsoft Azure, alongside hardware optimizations from AMD, Intel, Nvidia and others. The company is already working on even more advanced Llama 4 and 5 models as it pursues an audacious goal – to become “the leading AI in the world.”
Undergirding Meta’s AI ambitions is the narrative that it is reasserting its prowess in the generative AI arena through open source releases like Llama 3, keeping pace with rumored future blockbuster models like GPT-5 from OpenAI. However, significant debates persist around intellectual property constraints on training data, the validity and comprehensiveness of existing AI benchmarks, and the safe, ethical deployment of these rapidly scaling, multimodal AI systems.