What is the AWS and Cerebras inference partnership?

AWS and Cerebras are combining AWS Trainium-powered servers with Cerebras CS-3 systems in AWS data centers to deliver the fastest AI inference solutions for generative AI applications, available through Amazon Bedrock.

What is inference disaggregation in AI?

Inference disaggregation separates AI inference into two stages: prompt processing (prefill) handled by AWS Trainium, and output generation (decode) handled by Cerebras CS-3 systems, optimizing each stage with specialized hardware.

When will AWS Cerebras inference be available?

The disaggregated inference solution will be available through Amazon Bedrock within the next couple of months, making AWS the first cloud provider to offer Cerebras's disaggregated inference technology.

AWS and Cerebras Partner to Deliver Record-Breaking AI Inference Through Amazon Bedrock

Amazon Web Services and Cerebras Systems announced a major collaboration on March 13 that aims to set a new standard for AI inference speed and performance in the cloud. The partnership will bring Cerebras's specialized AI hardware into AWS data centers for the first time, delivering what both companies call the fastest inference solution available for generative AI workloads.

How the Technology Works

The collaboration is built around a technique called inference disaggregation, which splits the AI inference process into two distinct stages and assigns each to the hardware best suited for it.

The first stage, known as prefill, involves processing the user's input prompt. AWS Trainium-powered servers handle this compute-intensive step, leveraging their strength in parallel processing of large input sequences. The second stage, decode, generates the AI model's output token by token. Cerebras CS-3 systems take over here, using their wafer-scale architecture to deliver rapid sequential generation.

By separating these stages rather than running both on the same hardware, the system can optimize each independently — eliminating the bottleneck that occurs when a single chip architecture must compromise between two fundamentally different computational patterns.

Available Through Amazon Bedrock

AWS will be the first cloud provider to offer Cerebras's disaggregated inference solution, and it will be accessible exclusively through Amazon Bedrock. This means developers and enterprises can access the faster inference through the same API they already use for foundation models, without needing to manage specialized hardware directly.

The service is expected to launch within the next couple of months. AWS also plans to make open-source large language models, its own Amazon Nova models, and potentially its 2-trillion-parameter Olympus model available on Cerebras hardware later this year.

Why Inference Speed Matters Now

As AI applications move from experimental chatbots to production-grade agentic systems, inference latency has become a critical bottleneck. AI agents that need to reason through multi-step workflows, call external tools, and respond in real time demand inference speeds that current GPU-based solutions struggle to deliver consistently at scale.

Cerebras has built its reputation on inference speed, with its wafer-scale engine architecture delivering dramatically lower latency than traditional GPU clusters. The company recently signed a $10 billion inference deal with OpenAI, signaling growing industry demand for specialized inference hardware.

Competitive Implications

The partnership positions AWS more aggressively against competitors in the AI infrastructure market. By integrating Cerebras hardware alongside its own Trainium chips, AWS is offering customers a best-of-both-worlds approach rather than forcing them onto a single silicon platform.

For Cerebras, the deal provides massive distribution through the world's largest cloud provider, a significant step for a company that has historically sold its hardware to a smaller set of enterprise and research customers. The collaboration could accelerate adoption of disaggregated inference as an industry standard approach to serving large language models at scale.

AWS and Cerebras Partner to Deliver Record-Breaking AI Inference Through Amazon Bedrock

How the Technology Works

Available Through Amazon Bedrock

Why Inference Speed Matters Now

Competitive Implications

More in Tools

Google Turns Chrome Into an AI Coworker With Auto Browse, Powered by Gemini 3

OpenAI Launches Workspace Agents, Retires Custom GPTs for Teams

Cloudflare Launches Agent Memory Private Beta to Give AI Agents Persistent Recall