Why Many "AI Speech Therapy" Tools Are Getting It Wrong — And How Spoken Is Different

January 2026

Spoken AI Speech Therapy

Over the last year, a wave of "AI-powered" speech therapy websites and apps has entered the market. On the surface, many of these look impressive: sleek interfaces, instant feedback, and bold claims about accuracy. But under the hood, most of these products are built on a fundamental misunderstanding of how speech therapy actually works — and how AI can be used to support it.

The Shortcut Many Tools Make

The majority of speech therapy tools today rely on a simple, generic speech-to-text (STT) pipeline:

  1. Prompt the user to say a word
  2. Use a general-purpose STT model to convert the audio into text
  3. If the resulting text matches the expected word → Correct
  4. Otherwise → Incorrect

This approach is fast to build. It's cheap to scale. And it works reasonably well for adults dictating emails.

But it is not how speech therapy works, nor how it should work. In fact, this approach treats spelling accuracy as a proxy for sound production.

Example 1: "Rabbit" and the Problem of False Positives

Let's say a child is working on the r sound and is prompted to say the word rabbit.

A very common articulation error for r is gliding, where r is replaced with w. So the child says:

"wabbit"

What speech-to-text hears

Modern speech-to-text systems are designed to infer intent. They are trained to correct pronunciation errors automatically. Meaning, despite the incorrect sound production, the system outputs:

"rabbit"

and the tool responds:

"Correct! Great job!"

But clinically, this is inaccurate. The child did not produce the r sound. They produced w. The tool has just reinforced the exact error the child is trying to fix.

This is a false positive — incorrect speech marked as correct.

Example 2: "Marry" vs. "Merry" and the Problem of False Negatives

Now let's look at a different — and equally damaging — failure case. Say a child is asked to say the word marry while working on r.

The child produces a correct r sound. From a speech therapy perspective, this is a success.

However, many speech-to-text systems struggle to reliably distinguish between:

  • marry
  • merry
  • Mary

These words are often merged or normalized by speech models, especially in children's speech. So the model outputs:

"merry"

and the tool responds:

"Incorrect. Try again."

But clinically, this feedback is inaccurate. The child did produce the target r sound correctly. And now instead of receiving positive reinforcement, they are told they failed.

This is a false negative — correct speech marked as incorrect.

Over time, this kind of feedback isn't just confusing — it's discouraging. And it teaches children to distrust their own progress.

The Core Problem

Speech-to-text evaluates words.
Speech therapy evaluates sounds.

These are not the same thing.

  • A word can be spelled correctly even when the sound was inaccurate
  • A word can be spelled incorrectly even when the sound was right

When speech therapy tools rely on spelling as a proxy for sound production, both types of errors happen constantly.

This distinction is foundational in speech-language pathology — and it's where most "AI speech therapy" products fall apart.

How Spoken Is Built Differently

At Spoken, we didn't start with speech-to-text and try to retrofit it for therapy. Instead, we built our technology around the sounds themselves.

Our system uses machine learning and AI pipelines that are:

  • Tailored to specific target sounds
  • Trained to identify clinically meaningful speech error patterns, including gliding, vowelization, and distortions
  • Developed alongside licensed speech-language pathologists

So instead of asking "Did the model recognize the word?", we ask:

  • Was the target sound present?
  • How was it produced?
  • Which error pattern (if any) occurred?

This allows us to deliver feedback that is clinically accurate, consistent, and actually helpful for learning.

When Spoken Isn't Sure, We Don't Guess

There's another important difference — and it's one that matters deeply for kids: if Spoken's AI isn't confident, we don't give corrective feedback.

In real speech therapy, an SLP doesn't guess. If they're not sure what they heard — because the word was rushed, the room was noisy, or the sound wasn't clear — they won't confidently say "right" or "wrong." They might ask the child to try again, rephrase the word, or move on.

Spoken follows that same principle.

Our system is designed to recognize when it isn't confident enough in its assessment of a user's sound production. When that happens, we intentionally avoid labeling the response as accurate or inaccurate.

Why? Because wrong feedback is worse than no feedback.

  • Praising an inaccurate sound reinforces the error
  • Rejecting a correct sound undermines confidence

Instead, when Spoken isn't sure, we may:

  • Encourage the child to try again
  • Offer neutral support
  • Move forward without reinforcing an uncertain judgment

This keeps practice positive, accurate, and supportive, just like in a real therapy session.

We Also Know When It's Time to Move On

Spoken is also designed to recognize when a child needs a break. If a user struggles with the same word multiple times in a row, Spoken doesn't keep pushing it over and over again.

In a real therapy session, an SLP would rarely ask a child to repeat the same word once frustration starts to build. They might switch to a different word, change the activity, or come back to it later. Spoken follows that same approach.

When our system detects repeated difficulty with a specific word or prompt, we will do one of the following:

  • Move on to a different word that targets the same sound
  • Adjust the activity or difficulty level
  • Take a short break and revisit it later

This helps prevent frustration, protects confidence, and keeps practice engaging — especially for younger children.

Progress doesn't come from forcing repetition. It comes from the right challenge at the right moment. And since children learn through feedback, that feedback needs to be correct so progress doesn't slow or stop entirely.

Spoken is designed to support therapy — not undermine it.

Best-in-Class, By Design

Building clinically accurate AI is more difficult than adapting generic speech-to-text solutions. It requires:

  • Sound-specific models
  • Specialized training data
  • Deep collaboration with clinicians
  • Careful handling of uncertainty
  • And a willingness to slow down when a child needs it

We chose this path intentionally.

At Spoken, we're proud to be setting a higher standard for what AI in speech therapy can be — one built on trust, safety, and real clinical understanding.

← Back to Blog