Zach Anderson
Apr 18, 2026 00:53
Elon Musk’s xAI releases Grok Speech to Textual content and Textual content to Speech APIs at $0.10/hour, claiming lowest error charges throughout enterprise transcription benchmarks.
Elon Musk’s xAI dropped two standalone audio APIs on April 17, positioning Grok’s speech know-how as a direct competitor to ElevenLabs, Deepgram, and AssemblyAI at aggressive worth factors.
The Grok Speech to Textual content API runs $0.10 per hour for batch processing and $0.20 per hour for real-time streaming. Textual content to Speech is available in at $4.20 per million characters. Each leverage the identical infrastructure powering Tesla automobiles and Starlink buyer help.
Benchmark Claims Price Scrutinizing
xAI’s revealed phrase error charges inform an fascinating story. On cellphone name entity recognition—assume names, account numbers, dates—Grok STT claims 5.0% error fee versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a major hole if it holds up in manufacturing.
The corporate demonstrated this with a tough take a look at case: transcribing Welsh names like “Anghared Llewelyn Bowen” and “Oisin MacGiolla Phadraig” alongside mortgage particulars. Grok nailed it with zero errors. Competing fashions discovered pronunciations and formatted dates inconsistently.
Video and podcast transcription reveals tighter competitors—Grok and ElevenLabs tied at 2.4% error fee, with Deepgram and AssemblyAI trailing barely at 3.0% and three.2% respectively.
Technical Options for Builders
Past uncooked transcription, xAI in-built options that enterprise clients really need: word-level timestamps, speaker diarization throughout a number of audio channels, and help for 25+ languages with seamless switching.
The Inverse Textual content Normalization function mechanically converts spoken numbers, dates, and currencies into correct codecs. “4 one 4 5 5 5 one two three 4” turns into a cellphone quantity. “Six ninety-nine” turns into $6.99. Small element, nevertheless it eliminates post-processing complications.
Textual content to Speech contains inline tags for prosody management—whispers, laughs, sighs, emphasis, pacing changes. Builders can inject emotional nuance with out wrestling with advanced audio markup.
Strategic Context
This launch follows xAI’s acquisition of X Corp in March 2025 and comes as the corporate expands its infrastructure partnerships. Simply two days earlier than the API announcement, experiences emerged that xAI plans to produce computing energy to Cursor, the AI-powered coding startup.
The Colossus supercomputer, operational since December 2024, offers the backend muscle. xAI seems to be monetizing that capability throughout a number of verticals—enterprise AI, developer instruments, and now voice APIs.
For builders constructing voice brokers or transcription instruments, the pricing undercuts established gamers considerably. Whether or not Grok’s accuracy claims survive real-world deployment at scale stays the open query. The documentation and fee limits can be found via xAI’s API console for these prepared to check it.
Picture supply: Shutterstock

