Taalas Promises 10X Faster AI With Hard-Wired Llama Chip

Artificial intelligence may already outperform humans in narrow domains, but its path to ubiquity is being slowed by two constraints: high latency and major infrastructure costs.

Taalas claims it has a solution.

The company recently unveiled its first product, a hard-wired implementation of Llama 3.1 8B, delivered as both a chatbot demo and an inference API service. The startup says its silicon version dramatically reduces latency, cost and power consumption compared to conventional AI hardware.

The 2 Barriers Holding AI Back
From Model to Silicon in 2 Months
The Hard-Wired Llama: Performance Comparison
What’s Next for Taalas

The 2 Barriers Holding AI Back

AI’s promise is clear. In many focused applications, models already exceed human performance, and experts predict the technology will lead to a major reduction in workforces by 2030. When used well, AI serves as an amplifier of human productivity and creativity. But two issues currently limit adoption:

1. Latency

Interactions with large language models often lag behind human cognition.

Coding assistants may pause for minutes
Developers lose their state of flow
Agentic applications demand millisecond responses — not human-paced replies

2. Cost

AI has a voracious appetite for resources like energy, water and land. Modern AI deployments require:

Room-sized supercomputers
Hundreds of kilowatts of power
Liquid cooling systems
Advanced packaging and stacked memory
Massive I/O bandwidth and miles of cables

Scaling these systems means building sprawling data center campuses and absorbing extreme operational expenses.

Related Article: 13 AI Chip Companies You Should Know About

From Model to Silicon in 2 Months

"Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes."

- Ljubisa Bajic

Co-Founder & CEO, Taalas

Founded roughly 2.5 years ago, Taalas developed a platform that transforms any AI model into custom silicon in about two months after receiving it. The resulting hardware implementations, which the company calls Hardcore Models, are designed specifically for a single neural network.

According to Taalas, compared to traditional software-based inference, these systems are:

~10X faster
~20X cheaper to build
~10X lower power

Taalas’ approach rests on three core principles:

1. Total Specialization

Rather than building general-purpose AI accelerators, the company creates model-specific silicon. AI inference, it argues, is the most critical computational workload in existence, and the one that benefits most from deep specialization.

2. Merging Storage and Computation

Modern AI hardware separates memory (dense, cheap DRAM) and compute (on-chip logic). This divide introduces severe inefficiencies:

Constraint	Impact
Off-chip DRAM access	Thousands of times slower than on-chip memory
Memory-compute separation	Requires advanced packaging
High-bandwidth memory (HBM)	Increases complexity and cost
Massive I/O bandwidth	Drives power consumption
Liquid cooling	Adds infrastructure burden

According to Taalas, it eliminates this boundary by unifying storage and computation on a single chip at DRAM-level density.

3. Radical Simplification

By removing the memory-compute divide and tailoring silicon to each model, Taalas redesigned its hardware stack from first principles. Its architecture does not rely on HBM, advanced packaging, 3D stacking, liquid cooling or high-speed I/O fabrics.

This engineering simplicity enables an order-of-magnitude reduction in total system cost, the company claims.

The Hard-Wired Llama: Performance Comparison

Taalas HC1 hard-wired with Llama 3.1 8B modelTaalas

Taalas’ first public product is a hardened version of Llama 3.1 8B. The company selected the model for its small footprint, open-source availability and minimal logistical overhead. And while optimized for speed, the chip retains flexibility through:

Configurable context window size
Support for fine-tuning via low-rank adapters (LoRAs)

Taalas reports the following tokens per second per user:

According to the company, its silicon Llama achieves nearly 10X the speed of the current state of the art, while costing 20X less to build and consuming 10X less power.

At the time development began on Taalas's first-generation silicon platform, low-precision formats were not standardized. As a result, some quality degradation exists relative to GPU benchmarks. The company’s second-generation silicon will reportedly adopt standardized 4-bit floating-point formats to address these limitations while maintaining efficiency.

What’s Next for Taalas

Taalas has additional systems planned:

Spring: Mid-sized reasoning LLM (HC1 platform)
Winter: Frontier-scale LLM using second-generation HC2 silicon promising higher density and faster execution

Learning Opportunities

Webinar

Apr

The State of Enterprise Site Search: Moving Beyond "Good Enough"

Join CMSWire and SearchStax for a conversation about how enterprise IT and marketing leaders are moving beyond basic site search.

Webinar

Apr

AI for Your DXP: Connect What You Have, Transform How You Work

Most AI strategies stop at the platform—but work happens elsewhere. Bring intelligence into Teams, email, tickets and CRM.

Webinar

On demand

Content Leaders Collective: Navigating Content Decisions at Scale