Building a Radio Station for AI Agents

We run AI coding agents in parallel. Five, sometimes six at once, each working on a different part of a codebase. The problem isn't getting them to do work. The problem is knowing what's happening without staring at six terminal windows.

Terminal TUIs help. Slack notifications help. But both demand you look at a screen. Walk to the kitchen to make coffee and you're blind. Step outside and you miss the build failing. We wanted something that worked like actual radio: always on, always in the background, always reaching you wherever you are in the house.

So we built one.

What WRIT-FM taught us

The idea didn't come from nowhere. WRIT-FM is an AI-powered 24/7 internet radio station with five distinct DJ personas, time-of-day mood shifts, and a talk-first format. It generates long-form radio segments with Claude, renders them with Kokoro TTS, and mixes everything through Liquidsoap into an Icecast stream. Real radio, made by machines.

We didn't need the full station. We needed the pipeline. Music playing, a voice that could interrupt with useful information, and the whole thing running on one box in the home lab. WRIT-FM proved the stack worked. We just needed to point it at a different problem.

The three-channel model

The system mixes three audio channels into a single stream:

Music plays continuously from a curated library. Random shuffle, crossfade between tracks. This is the ambient layer, the texture of the session. You don't listen to it. You hear it.

Voice is the announcement channel. When an agent finishes a task, hits an error, or gets stuck, a webhook fires. The brain translates the event into a sentence, Kokoro renders it to speech, and Liquidsoap ducks the music and plays the voice clip. Five seconds later, the music comes back. You heard what happened without looking at anything.

Tones are the subtlest channel. Short sound effects, under two seconds, layered on the music at 25% volume with no ducking. Rising chime for agent start, resolved chord for completion, brief dissonance for failure. We'll get into why these specific sounds work later. After a day, you stop consciously hearing them. You just sense whether the session is active or quiet.

The tracer bullet

We didn't build each component to completion before moving on. We fired a tracer bullet: the thinnest possible path from curl POST to hearing a voice on the stream. About 80 lines of Python and 15 lines of Liquidsoap config. No rate limiter, no WAV validation, no graceful shutdown. Just enough to prove the pipeline worked end to end.

That first test, hearing "tracer bullet test" spoken over ambient music through a browser tab, was the moment we knew the architecture was right. Everything after that was widening: proper config loading, event templates, error handling, the dashboard.

If you want to try it yourself:

git clone https://github.com/nmelo/radioagent.git
cd radioagent
docker compose up -d

Then send an announcement:

curl -X POST http://localhost:8001/announce \
  -H 'Content-Type: application/json' \
  -d '{"detail": "Hello from Radio Agent"}'

You should hear it within a few seconds.

Why personality matters for TTS

Kokoro TTS generates speech in about 50 milliseconds on a GPU. Intelligible, natural enough that you don't cringe. But template-based announcements like "eng1 finished: auth refactor done" sound robotic when spoken aloud. Flat intonation, no rhythm, no warmth. After ten of those in a row, you want to turn it off.

So we wrote a DJ skill. It's a Claude Code skill file that rewrites template text into something that sounds like a person said it. "eng1 landed the auth refactor. Single commit, no test breakage. That's a Tuesday well spent." Same facts. Sounds like a person said it. The personality rules are strict: measured pace, short sentences, dry humor that earns itself from the material. Never corny, never morning-DJ energy, never exclamation points.

The rules matter more than the style. We borrowed this from WRIT-FM's persona system, where each DJ has a detailed list of things they never do. "Never say amazing. Never use filler. Never be condescending about musical knowledge." Personality comes from constraints.

The dangling string

Mark Weiser and John Seely Brown published a paper in 1996 called "The Coming Age of Calm Technology." In it, they describe the dangling string: an eight-foot plastic cord connected to a motor that responds to network traffic. Busy network, the string whirls. Quiet network, small twitches. Nobody watches it. Everyone knows the state of the network.

We use "periphery" to name what we are attuned to without attending to explicitly. By placing things in the periphery we are able to attune to many more things than we could if everything had to be at the center. Weiser & Brown, 1996

Radio Agent is our dangling string. Music is the string at rest. Announcements are the string whirling. You're not watching it. You're attuned to it.

The thing Weiser got right: information that moves between the periphery and center of your attention is calming. Information that parks itself at the center is exhausting. A notification badge that sits there all day demanding attention is the opposite of calm. A voice that speaks for five seconds and then disappears? That's calm.

The ambient tones channel takes this further. Tones for events like "agent started" and "agent idle" were previously suppressed entirely because spoken announcements would be constant noise. But a soft chime at 25% volume? That's the dangling string twitching. After thirty minutes, the real test is: "Did you notice the tones?" The ideal answer is "not really, but I knew eng1 was busy."

What's next

The station works. Music plays, voices announce, tones chime, the dashboard shows it all. But honestly it's still a notification system with good audio design. The next phase is making it feel like an actual radio station.

We want scheduled programming. Different music moods by time of day: gentle ambient in the morning, something with more rhythm in the afternoon, deep drone for late night sessions. The DJ personality already shifts slightly by time of day, but the music should follow.

We also want better voices. Kokoro is fast and gets the job done. But models like Fish Speech and Orpheus 3B have more expressive prosody and emotion tags. We evaluated them and the quality jump is real. The problem is latency: 50ms vs 5-10 seconds per clip. Hard to justify when the current voice already fits the <10s end-to-end SLA. We'll revisit as the models improve.

The long-term vision? A full AI radio station. Named shows with different hosts. Station IDs and imaging. Cross-show continuity where DJs reference each other. Not because we need any of that for agent awareness, but because the infrastructure is there and it turns out building a radio station is just a good time.

The whole thing is open source. Check it out on GitHub.