What listeners actually hear
The most powerful delivery lever is the one a transcript can't see
Guyer, J. J., Fabrigar, L. R., & Vaughan-Johnston, T. I. (2019). Speech Rate, Intonation, and Pitch. PSPB. See also Niebuhr et al. (charismatic speech, TED corpus) and Goupil et al. (2021), Nature Communications.
Ask people what makes a confident speaker and they reach for fillers and "ums." The data points elsewhere. Pitch and intonation variation - vocal variety - is among the strongest predictors of perceived confidence, charisma, and persuasion. Niebuhr's analysis of the TED corpus found the top talks carry roughly 30% more vocal variety than average. And here is the catch: none of it shows up in a transcript.
Data table
| Item | relative weight as a confidence cue |
|---|---|
| Pitch / intonation variation (acoustic - invisible to text) | 88 |
| Falling terminals on conclusions (acoustic) | 74 |
| Adequate volume & range (acoustic) | 62 |
| Few hedges ("I think", "sort of") (visible in transcript) | 55 |
| Filler rate (visible in transcript) | 38 |
Confidence is a bundle, not a single tell#
Goupil and colleagues showed perceived confidence is a composite: falling intonation, adequate volume, short response latency, low hedging, a moderate-fast rate, few fillers - read together. No single marker is "confidence." Which is exactly why scolding each one separately misfires; the right move is to score the bundle and coach the highest-leverage piece, usually vocal variety.
Data table
| Item | pitch/intonation range (relative) |
|---|---|
| Top TED talks | 130 |
| Average talk | 100 |
What it means for Speech Away#
A transcript cannot hear prosody - so a text-only "confidence" score is really a hedging-and-fillers proxy. That is why we send the audio itself to a multimodal model (Gemini), which can assess pitch variation, falling terminals, pace, and tone, not just words. It unlocks the Tier-1 lever a transcript-only pipeline is blind to: the actual sound of confidence.