Voice In, Text Out

Jan Carlo Mityorn

✦

Voice In, Text Out

How I stopped typing after thirty-five years at the keyboard

Jan Carlo Mityorn·April 21, 2026·7 min read

The keyboard has been the dominant interface for technical people for so long it stopped feeling like a choice. It just felt like what computers were.

The Hierarchy

There's always been a divide. On one side: people who interface through commands, shortcuts, and code. On the other: people who only use the graphical user interface those other people built. The keyboard sat at the center of this. If you could type instructions directly to the machine, you could make it do more than anyone who couldn't.

That hierarchy still exists. What's changed is the interface itself.

Voice Recognition Was Always a Joke

For decades, voice recognition was the technology that was always almost good enough. You'd use it, make the same correction four times, give up, go back to the keyboard. The error rate wasn't just annoying — it was high enough to make the whole premise feel like a bad idea dressed up as a feature.

AI fixed this. Modern speech to text models are specifically trained for transcription and the accuracy is in a different category entirely. But that's not actually the important part.

The important part is what happens after the transcript. Even excellent transcription fails on uncommon words — specialized terms, names outside the dictionary. My last name, Mityorn, gets mangled in creative ways. A D instead of the T. A J instead of the Y. Sometimes something completely different. Because it's not a word the model has learned to expect.

The solution is simple: show the recognized text before sending, let the user correct individual words, remember the corrections. The more you use it, the better it gets at your specific vocabulary — and your vocabulary is finite. The people you work with, the commands you return to, the concepts that define your particular workflow. The system learns your world. This is machine learning in the most literal sense, the same thing traditional voice software did years ago, but starting from a much better baseline.

A Brief History of Hating Your Own Voice

Long before mobile phones, there was a period of life built around ringing tones. You'd call someone, it would ring, and if they weren't home, nobody answered. The first solution wasn't to untether the phone — that came later. The first solution was the answering machine.

Mechanically simple: a tape recorder wired to the phone line. Call comes in, machine picks up, plays a greeting, records whatever you say. The person comes home, sees the blinking light, hits play. It was a genuine breakthrough — the first time communication between two people could be reliably asynchronous. Some people embraced it so completely they'd actively hope no one would answer. Just leave the message and move on.

But a lot of people couldn't do it at all. They'd hear the beep and hang up.

Answering machine anxiety was widespread enough to be a recognized cultural phenomenon. The cause is simple: most people have never actually heard their own voice until they hear a recording of it. And when they do, it's jarring. Your voice sounds completely different to other people than it sounds inside your own head. You hear yourself for a moment and think: is that really how I sound? And the answer is yes, that is exactly how you sound, and for some reason this is profoundly unsettling.

I had this anxiety for years. It never fully went away. I eventually got functional with answering machines — but I remember the first time I heard myself rapping, and making a fairly immediate decision not to pursue rap professionally. Probably the right call. I still write the lyrics. But the initial recoil from my own recorded voice was more powerful than whatever ambition I had to share them.

When voice messaging came to chat apps, I was slow. The efficiency argument is airtight — speaking is faster than typing, and on a phone the gap is absurd. And yet.

The pattern is consistent enough across my life with voice recording that I can describe it in advance: resist, eventually try it, discover it's useful, wonder why I waited so long, feel briefly wise for having figured it out — and then repeat the whole cycle with the next iteration of the same technology. Each time I think I've learned the lesson. Each time the same resistance shows up wearing new clothes.

Robert Stack

My wife falls asleep to Unsolved Mysteries. Specifically to the narration of Robert Stack, whose voice she finds so soothing she's been using it as a sleep aid for years. At some point it occurred to me to make her a little AI chat companion in his voice.

Getting it to actually sound like him — I'll stay deliberately vague, because if anyone from the estate is reading this, I'd like to note that it's purely personal use, will never be published, and I didn't technically train on his actual voice. Regardless, it sounds convincing enough that my wife accepts it without question. She is, by any reasonable standard, a connoisseur.

The point is that for this application, voice output was already the natural choice. Text would have defeated the purpose entirely. And once the output was voice, the input obviously should be too. So that's when I started building speech-to-text into Sentio properly — wiring it up, testing it on real use. It worked. My wife could speak to her own personal Robert Stack (or Bob as she calls him), who knows what the weather is like where we live and can hold a light conversation. A good enough context-aware chat companion for falling asleep to.

And now Sentio had a voice input layer. Which meant I did too.

The Quest

I made a deliberate decision to force myself out of old habits. Just try voice prompting as the primary mode. Commit to it long enough to actually evaluate it.

The first attempts were awkward. Expected. What was less expected was how quickly the awkwardness gave way to something genuinely better. Two things pushed this along.

First, the transcripts from open speech-to-text models are excellent. Not perfect — the uncommon word problem is real — but good enough that the output is workable without heavy editing.

Second, and more importantly: large language models are remarkably good at extracting intent from a rambling transcript. You can say something wrong mid-sentence, immediately correct yourself — no wait, I meant something different — and the whole thing gets transcribed including the correction, and the agent just works through it. It reads the full mess and understands what you were actually trying to say. This is not a small thing. It means the bar for voice input isn't "produce a clean message." It's just "say roughly what you mean." Which is a bar that's very easy to clear.

A few weeks later: I don't type anymore.

The only keypresses my keyboard gets now are the ones I've configured as triggers — start recording, pause, send. I'm now genuinely expecting those specific keys to wear down while the rest stays pristine. Something I never would have predicted in over thirty-five years of frantic typing.

What's Still Typed

The keyboard hasn't disappeared entirely. Not everything runs through my voice system yet, so there are still tools that require typing. And there's a systems administration layer where the keyboard stays non-negotiable — when a server goes down and I need a direct terminal, I type. When Sentio itself is down, I type the Linux commands directly.

But the trajectory is clear. More and more of the workflow goes through voice. My estimate for the near future: most of my interactions with computers will be by voice.

The Next Version of Voice Messaging

Chat systems will soon offer a new kind of voice message: you record it, the other person never hears it but instead receives a clean processed transcript. This removes the psychological friction of having your actual voice heard by others and dramatically increases the speed at which information moves between people.

The evidence this is coming is already everywhere: playback speed controls. Every major chat app has 1.5× and 2× options. Nobody talks about this feature. Everyone uses it. We use it because voice messages feel slow, because we find ourselves waiting for the person to get to the point. If you can just read a clean transcript instead — it's obviously better. It's a no-brainer and it's coming.

To My Fellow Keyboard Users

This is me speaking to fellow coders, writers, researchers — anyone who's spent years at a keyboard and thinks of it as the natural mode of talking to machines.

For me, bringing voice prompting into the workflow felt almost like breaking an addiction. The keyboard wasn't just a tool; it was the tool, the one I'd used for my entire adult working life. Getting past that required a conscious decision, and then a period of deliberate practice, and then one day the habit had simply shifted. It's not an instant transformation. It's a process. But once you start it, you won't go back. And just like an addiction, once you actually break it, you feel a sense of glorious relief.

Most tools already have a record button somewhere. You've maybe never clicked it because you're just used to typing. Click it. Send a voice prompt. Don't be shy.

This is the future of human-machine interfacing — until the neural connections arrive, which is a different conversation for another time. Until then:

Voice in, text out. Speak. Read.