Case study
How I taught an AI to think like a musician
Producers don’t talk about their music in generalities. They say “the bass is sitting too low right here,” “the chorus feels crowded,” “this melody isn’t landing in the pocket.” Getting an AI to actually meet that standard — to reason about a specific eight-bar window of a specific track, not about music in the abstract — turned out to require more than a smart prompt.
This is a walkthrough of what I built, what broke, and what I learned. The text you see tells the full story. Expand any ▸ technical detail for the engineering specifics.
The product
What StemSense does
Upload a track and the pipeline separates it into six stems — vocals, bass, drums, guitar, piano, and other — then runs analysis at every level: key, tempo, section boundaries, chord timelines, note events, register, rhythmic patterns. All of it happens in the background before the interface loads.
The core interaction is simple: draw a region over any stem in the timeline, ask a question. The system answers from measurements taken off that specific selection. Not a summary of the whole track, not music theory defaults — the actual notes in that bar, on that stem, in that window.
Generation works the same way. Ask for a bass line that fits a specific chorus and the system produces options that have already been scored against the track’s key, register, and rhythm — returned as playable audio, notation, and MIDI.
Reach up — voicing in the upper register
▸technical detail — analysis pipeline
Source separation uses Demucs (htdemucs_ft). All downstream analysis runs on separated stems, not the mix — this significantly improves tonal analysis quality by removing bleed from other instruments.
Feature extraction uses Essentia:
- Key/scale via three-algorithm consensus: EDMA, Krumhansl, Temperley
- Chord timeline with frame-accurate timestamps via HPCP chroma analysis
- Beat positions, meter, and per-beat loudness via beat tracking
- Per-stem note events carrying: time, duration, pitch, velocity, beat subdivision, and onset-to-beat offset in milliseconds
- Per-stem: onset rate, spectral centroid, energy contour, pitch range, voiced ratio
Vocal content uses Whisper for lyric transcription and CLAP (laion/larger_clap_music_and_speech) for semantic tagging — comparing audio against natural-language descriptors to surface stylistic context that MIR features alone can’t capture.
All of this runs asynchronously at upload time. Window-level analysis only runs at query time, on the selected region.
The quality problem
Accurate isn't the same as useful
Once the pipeline was working, the first round of real responses exposed something unit tests can’t catch. The model was accurate. It just wasn’t useful.
Ask “do the bass and drums lock together?” and it would describe the chord progression. Ask about a rhythmic passage and it would lead with harmony. Select a single stem and it might decline entirely — treating the selection as too narrow to discuss. Select eight bars and it would generalize about the whole track. None of it was invented. Factually defensible, musically useless.
One trace made it undeniable. A user asked about the groove in the intro of Track C — a window where both stems had already been flagged empty by the analysis layer. The response came back with specific density figures, a tension value, and chord function analysis. The data it used was real, pulled from track-level summaries. It just had nothing to do with what was selected.
“The analyzed segment, spanning the ‘start’ and ‘intro’, is set in [key] at a moderate tempo of [BPM]… The bass line exhibits moderate density (2.4 onsets/s) and is characterized by a ‘high tension’ value, suggesting chromaticism or non-diatonic movement… The exact function of the [chord] (V or V/VI) is context-dependent…”
Same question, two system states: “Do the bass and drums lock together in this section?”
Track C intro · bass + drums · stems flagged empty
“The analyzed segment is set in [key] at a moderate tempo of [BPM]. The bass line exhibits moderate density and is characterized by a high tension value, suggesting chromaticism or non-diatonic movement.”
“The [chord] function is context-dependent and may indicate a secondary dominant or modal interchange relationship with the tonic. The progression moves through the following chords…”
Without an interpretive frame
Track A intro · bass + drums · groove window
“The kick and bass are locked tight through this window — bass onsets land within a few milliseconds of the downbeat consistently, and both parts are sparse enough to leave room between hits.”
“The rhythmic feel is settled and deliberate, not busy. Nothing is rushing.”
The frame directive: "Rhythmic identity is the primary anchor. Lead with timing, groove, and how the parts interlock — not with harmonic content."
With pulse frame active
▸technical detail — the failure taxonomy
The trace above reveals the core architectural issue: the LLM was receiving track-level feature summaries and responding as if they were window-level observations. The analysis layer had correctly flagged the window:
{
"heuristic_id": "H004_cross_stem_lock_score",
"status": "empty_slice",
"score": null,
"notes": "No audio/note events in selected slice"
}I catalogued four failure modes:
- Out-of-window drift — track-level summaries passed as if window-specific, producing responses about the song at large when a specific 8–32 bar window was selected.
- Chord-motion overfocus — harmony-first responses regardless of whether the interesting feature was rhythmic, timbral, or structural.
- Single-stem rejection — the discuss gate was calibrated too conservatively; a single piano stem with rich harmonic and timing evidence over a 30-second window would be declined as “too sparse.”
- Broad-question abstention — “what’s happening here?” collapsed into generic disclaimers rather than a focused narrative overview.
The root in all four cases: the model received feature data but no structured guidance on what kind of musical moment this was and which features mattered for it.
The investigation
I ran an experiment, not a vibe check
Rather than iterate on prompts until things felt better, I treated it as a measurement problem. I wrote a hypothesis, defined a scoring rubric, picked five test tracks that each stress-tested a specific failure mode, annotated specific windows by hand before changing anything, and scored baseline outputs against that rubric first.
The tracks were chosen deliberately. Track A: groove and pocket without harmonic complexity. Track B: syncopated cross-stem interaction and phrase-level call-and-response. Track C: form and energy arc across a verse-to-chorus boundary. Track D: call-and-response in a live recording with real room ambience. Track E: compound meter in 6/8, with section boundaries that don’t fall on clean phrase lines — an original WIP.
The win criterion was concrete: objective metrics had to improve on at least three of five samples, the rubric composite had to improve on at least 60% of annotated windows, and neither hallucination rate nor scope violations could increase. “Feels better” without rubric movement didn’t count.
▸technical detail — experiment design
Primary hypothesis: a deterministic narrative substrate — salience weighting, tension arc, interaction labels, strict window scope enforcement — will improve musician-like explanation quality compared to routing-only context.
Each run used three fixed prompts across all five tracks and all annotated windows: (1) “What is happening in this section?” (2) “Describe the musical narrative in this selected window.” (3) “How do the selected stems interact here?”
Objective metrics per run:
- Scope violation rate — events mentioned outside the selected window
- Chord-dominance ratio — harmony-only claims ÷ total claims
- Cross-stem interaction density — explicit relational claims per response
- Uncertainty discipline — uncertain cases acknowledged vs. guessed
- Narrative axis coverage — presence of tension-release, phrasing, and interaction angles
Human rubric (1–5 each): in-window fidelity, musician readability, tension/release coherence, melody/rhythm integration, cross-stem narrative coherence, foreground anchoring clarity.
Per-window annotation fields: foreground stem, tension arc type (grounded / searching / intensified / release / unresolved), arc peak timestamp, interaction type (support / contrast / counterline / call-response / doubling), groove label, tension level, and free-text notes. Each annotation carried a confidence score. The annotation schema served as the ground truth for all calibration evaluation.
First intervention
Routing: point the AI at what matters
The first real fix was a deterministic routing layer. Before the AI sees any data, a rules-based system classifies the intent behind the question and the character of the selected audio, then decides which musical dimensions are actually relevant.
Bass and drums with a groove question routes through rhythm evidence. A single melodic stem routes through pitch and phrase-shape. A broad “what’s happening here?” routes through a narrative overview path that accounts for which stem is in the foreground and what structural role the window plays.
Two things improved: the rubric moved, and the system became traceable. Every response came with a logged routing decision, so when something came out wrong there was a place to look.
▸technical detail — heuristics H001–H015
The router uses a focus taxonomy of six intent categories and 15 deterministic heuristics computed from the audio analysis at query time. Several heuristics surfaced significant data integrity problems during development.
H001 — Waltz/Compound Pulse
Detects compound-meter feel by checking bass emphasis on the downbeat and harmonic events on the remaining beats, weighted against a syncopation penalty. Gated to only activate when the detected meter is triple or compound triple — correctly skipped entirely for 4/4 tracks.
H003 — Offbeat Density (initially dead)
Returned missing_input: Too few onset events for every single window — 11/11 failures in the first pass. Root cause: the underlying rhythm analysis tool only dispatched results for isolated (non-overlapping) single-stem brackets. Any multi-stem selection never produced a valid quantized onset pattern. Fixed by computing offbeat proportion directly from beat-subdivision data already present on each per-window note event record, bypassing the failing dispatch path.
H004 — Cross-Stem Lock Score (corrupted data)
Combines onset coincidence rate, mean timing offset between stems, and offset jitter into a single lock score. The raw data feeding the offset calculation was wrong during calibration: Track E first verse showed a mean inter-stem offset of +4033ms; Track D piano intro showed −11502ms. A four-second offset is physically impossible in music. Root cause: when a stem had no pitched notes (drums always hit this path), the code fell back to full-track onset times from the upload-time analysis summary — a list covering the entire song, not the selected window. Injecting timestamps from outside the window made coincidence rates meaningless. Fixed by windowing onset times to the selected region before computing offsets.
H010 — Tonal Stability (redesigned twice)
The original version combined track-level key strength and track-level dissonance into a single score. Result: a constant near-identical value for every window on the same track regardless of what was selected. This created a tonal floor that won the frame competition by default whenever other signals were weak.
The redesign replaced both inputs with two window-level components: (1) pitch-class concentration — how focused the note mass is on a small set of pitch classes within the window, and (2) phrase-repetition similarity — whether the pitch-class distribution is consistent across musical phrases in the window. Multiplying the two together made the score high only when a window is both tonally focused and internally consistent — not just because professional recordings always have some pitch focus. Score range expanded from a flat 0.49–0.52 to genuine variance across windows.
A competing-signal dampening step was also added: when strong rhythmic or vocal signals are present, the tonal score is proportionally reduced. This was necessary because any focused musical phrase — even a rhythmic one — has concentrated pitch content in the technical sense.
H014 — Call/Response Clarity (structural limitation)
This heuristic measured note-level alternation between stems — event-by-event stem switches in a merged onset timeline. The problem: genuine call-and-response is phrase-level alternation. Two stems trading four-bar phrases produce a low alternation score because within each phrase all events are from the same stem. Two stems with interleaved note onsets (drums + bass playing simultaneously) produce a high score despite having no call/response relationship. Track B verse-chorus (genuine call/response, vocal over keyboard) and Track C verse (not call/response, but vocal phrasing happens to interleave with bass) scored nearly identically. The heuristic was measuring note-level interleaving, not phrase-level handoff. The interaction frame was removed from the selectable set pending a phrase-level redesign.
The deeper gap
Routing wasn't enough
The rubric composite improved. But the chord-motion failure persisted — even when the router correctly identified a rhythmically-driven passage, the model still led with harmony.
There’s a real difference between telling the AI what evidence to look at and telling it what kind of musical moment this is. A waltz with active chord changes should be narrated as style-consistent motion in triple meter — not as harmonic instability. A funk groove with the same harmonic content should be narrated as background color behind the rhythmic pocket. The evidence is identical. The interpretive frame is different.
Heuristics don’t communicate that. A frame does.
The frame selector
Giving the AI a point of view
The frame selector maps heuristic outputs to one of five interpretive frames and injects a short directive before the AI generates anything — not just data, but a specific lens for reading it. For the waltz problem, that directive was: “Do not interpret harmonic motion as instability unless timing evidence also weakens.”
The five interpretive frames
Rhythmic identity is the primary anchor.
Lead with timing, groove, and how the parts interlock — not with harmonic content.
"The kick and bass are locked tight here — bass lands consistently within milliseconds of the downbeat."
signal sources
Compound-pulse signature, backbeat ratio, cross-stem onset lock, microtiming tightness
Harmonic and melodic character define the moment.
Lead with key center, chord color, and how voices move — not with rhythm or energy.
"The piano stays close to the tonic cluster through this whole phrase — very stable, no harmonic tension."
signal sources
Pitch-class concentration, phrase-to-phrase tonal consistency, dampened when strong rhythmic signals present
This is a structural moment or transition.
Lead with section function — what came before, what changes here, and why it matters structurally.
"This is the verse-to-chorus break — everything opens up: more stems enter, energy jumps, rhythm densifies."
signal sources
Proximity to section boundary, section-to-section contrast in energy, density, and rhythmic character
A lead voice is the interpretive focus.
Lead with the lead — what it is doing, how it moves, how the other stems support or contrast it.
"The vocal is phrasing loosely against the beat here — landing slightly after the 1 each time for a laid-back feel."
signal sources
Vocal rhythm alignment to beat grid, secondary tonal contribution — only active when vocal data is present
A short, zoomed-in musical gesture.
Lead with the specific notes — what was chosen and why those choices matter in context.
"That passing tone on beat 3 — the F# over an Am chord — is carrying most of the tension in this two-bar figure."
signal sources
Window duration; score decays from high (very short) to low (approaching 30-second limit)
Calibrating against 12 annotated windows from the test tracks revealed how badly the initial version behaved. Starting accuracy was near zero. A single signal — a section-alignment detector — was dominating every evaluation window because every window had been drawn at a section boundary. It scored at maximum on nearly everything and, carrying too much weight, pushed the “form” frame above the winning threshold before any other signal had a chance.
When I reduced that signal, a second problem surfaced: a different heuristic had been returning a constant value for every window on the same track, winning by default whenever the now-weakened structural signal wasn’t dominant. Two independent failures, each masking the other.
The fix was to trace every heuristic back to its data source and confirm it was actually operating on the selected window — not on the full track. Once the data integrity problems were resolved, the system had real discriminating power. Final accuracy: 75% on positive cases, 100% on negative cases. The remaining mismatches are documented — mostly short windows that don’t contain enough signal to determine the frame unambiguously, which is a known and accepted limitation.
▸technical detail — calibration version history
| Change | Positive accuracy |
|---|---|
| v0 baseline — structural signal over-weighted, tonal floor active | ~0% |
| v1 — structural signal weight reduced | 27% (3/11) |
| v2 — short-window suppression added | 9% regression |
| H003 + H004 data integrity fixes | 9% |
| H013 energy data corrected (section energy now computed correctly) | 9% |
| H010 first pass: track-level floor replaced with window-level concentration | 9% — new floor emerged |
| H010 second pass: phrase-repetition component added | 18% |
| H010 third pass: competing-signal dampening applied | 36% |
| Short-window suppression threshold corrected | 45% |
| Manifest trimmed to ≤30s windows; interaction frame removed | 55–73% |
| Track A intro annotation corrected (annotation error, not system error) | 75% (9/12) |
▸technical detail — frame definitions and specific window examples
The five frames and their contributing signals:
| Frame | Primary signal sources |
|---|---|
| pulse | Compound-pulse signature, backbeat detection, cross-stem onset lock, microtiming tightness |
| tonal | Pitch-class concentration × phrase similarity, dampened by competing rhythmic/vocal signals |
| form | Proximity to detected section boundaries, section-to-section contrast (energy + density + rhythm delta) |
| foreground | Vocal rhythm alignment, secondary tonal contribution — only active when vocal data is present |
| micro_event | Window duration — score decays from high (very short windows) to low (approaching 30s limit) |
Specific window outcomes from calibration:
Track E — piano intro (13s) PASS
Human: tonal · Selector: tonal. Solo piano with no competing rhythmic or vocal signals. Tonal stability score very high (highly concentrated, consistent pitch content across phrases). Clean selection.
Track C — verse-to-chorus boundary (20s) PASS
Human: form · Selector: form. Section boundary falls roughly mid-window. Structural signal fired at maximum; section contrast score low but present after the energy data fix. Without the energy fix, contrast scored near zero because the verse→chorus energy jump — primarily a loudness/instrument-density shift — was not captured by density and rhythm metrics alone.
Track E — late chorus (30s) FAIL (accepted ambiguity)
Human: pulse · Selector: foreground. The 6/8 waltz feel was detectable from the compound-pulse heuristic, but the vocal alignment score pushed foreground above pulse. Root cause: the heuristics cannot perceive “rhythm asserting against vocal” — the listener’s sense that the rhythmic feel is the dominant character despite an active vocal. That perceptual relationship requires phrase-level analysis the current system doesn’t have.
Track D — piano intro (25s, live recording) FAIL (edge case)
Human: pulse · Selector: tonal. Tonal and pulse frame scores separated by a narrow margin — effectively a coin flip. Root cause: the live recording’s room ambience caused tonal stability to read very high (diffuse room sound creates consistent, concentrated pitch-class content that the pitch concentration measure can’t distinguish from intentional harmony). Documented as a known live-recording edge case.
Discuss mode pipeline
User selection
Stem name · start time · end time
Analysis tools
Rhythm, harmony, energy, motion — computed against the selected window only
Heuristics H001–H015
Deterministic scoring — compound pulse, tonal stability, section contrast, vocal rhythm, cross-stem lock, and more
Frame selector
Weighted scoring across heuristics → dominant frame + confidence + cautions
Interpretive lens
Short structured directive injected into LLM context before any text is generated
LLM narrator
Analyst call produces structured findings in plain English · Translator call produces the response
Response
Grounded in the selected window · framed by the dominant musical character
The frame selector’s output includes: dominant frame, optional secondary frame, an interpretive lens directive (a short plain-English instruction for the LLM), a cautions list (e.g. “do not interpret harmonic motion as instability unless timing evidence also weakens”), and a confidence score. Windows exceeding 30 seconds return frame: unknown with a window_too_long caution rather than a low-confidence guess.
The discuss architecture
Two agents, one job each
The frame selector solved what to say. It didn’t solve how to say it.
The original system fed heuristic scores and analysis data directly to the conversational LLM, which was expected to both interpret the data scientifically and communicate it naturally. I started calling the failure mode “talks about tool outputs” — responses that read like API output instead of musical commentary.
The fix was to split the job. A first model call — the analyst — takes all the evidence and produces a structured findings object in plain English. A second — the translator — takes only that object and writes the response. The analyst never speaks to the user. The translator never sees raw data.
▸technical detail — the analyst/translator boundary
The boundary object between agents is a typed findings structure. The translator receives this and nothing else from the analysis layer. A critical constraint on the analyst’s output: the plain-English summary field is limited to 2–3 sentences with no numbers, no internal identifiers, and no tool output names. Everything measurable stays in structured sub-objects; the summary is purely interpretive.
An example summary field:
“The bass and kick are loosely locked — bass is consistently rushing by ~15ms, which sits at the perceptible threshold. The chord progression is stable and tonal-center-focused with low tension. Melodic register is mid-range and moderately busy.”
The structured findings cover five musical dimensions: groove (lock quality, timing offset, feel label), harmony (key, chord motion, tension level and points, modal function), melody (contour, phrase structure, register, pitch range), texture (register relationship between stems, lead stem, masking risk), and dynamics (shape, energy level, transient character). The translator can access any of these by field but only the summary is required reading — the translator adapts its level of detail based on what the user asked.
Generation
The same lesson applied to output
The generation side had the same problem in a different form. A single model call received heuristic scores and the user’s request and was expected to both understand the musical context and produce fitting ideas in one pass. The outputs were inconsistent — wrong note counts, rhythms not anchored to the window, melodies that didn’t fit what was actually playing in those bars.
Same architecture, same fix. A deterministic layer assembles a structured directive — what to preserve, what to change, what constraints apply — and a separate model works from that directive alone. The interpreter doesn’t generate. The generator doesn’t interpret.
▸technical detail — generation directive and create mode pipeline
The generation directive is assembled deterministically — no model call. Heuristic scores combine with intent keywords from the user’s request to produce a set of generation hints. The mapping is rule-based: a low offbeat density score combined with a “funky” or “groove” keyword produces a hint like “add notes at offbeat positions; 16th-note subdivision preferred.” A high register-overlap score combined with “stand out” produces a hint to push the target stem into a higher register range. The generator LLM receives these hints rather than raw scores.
The directive also carries: the generation route (surgical fix / melodic rewrite / free composition / density change / register shift / harmonic rewrite / borrow from source / compose over source stems / rhythm transfer), the current notes in the window in the same schema the generator outputs — giving the generator a concrete reference state regardless of how much is being changed — and hard constraints including the key, allowed pitch classes, and a beat-count ceiling the generator cannot exceed.
The original bug driving this redesign: the intent capture layer used a single field that was doing two jobs simultaneously — acting as a mode gate (does this trigger note generation?) and describing the conversation direction (what is the user trying to achieve?). The LLM would sometimes classify a discussion question as a generation request if the phrasing sounded like “what if.” This produced MIDI output when the user wanted analysis. The fix: mode comes from the UI only. Intent capture now produces typed structured output, not a free-text classification.
What I learned
LLM quality is an upstream problem
The most useful thing this process taught me isn’t specific to music.
When an AI-powered product produces bad output, the instinct is to rewrite the prompt. That almost never fixes a structural failure. Every meaningful improvement here came from changing what the model received — not how I asked it to respond. Routing, frame selection, the two-agent split for analysis and for generation. All pre-model. The model itself didn’t change across any of them.
What made these problems diagnosable wasn’t engineering knowledge. It was being able to say “this response is wrong because it’s describing normal harmonic motion in a waltz as instability” and knowing exactly why that’s wrong musically. Domain knowledge built the rubric. The rubric made the fix possible. That’s a pattern worth holding onto.
▸technical detail — the track-level vs. window-level problem
The recurring architectural failure had a single root cause: track-level data presented as if it were window-level observations. It appeared independently in three heuristics:
- H002 (backbeat) — all three Track C windows returned an identical value, including a quiet intro with no drums in the selection. The score was the drums stem’s overall track regularity, not the window-level beat.
- H009 (microtiming) — 9 out of 11 positive calibration windows returned values within a 0.004-point range — effectively identical. The fallback was using the stem’s pre-computed beat regularity from upload-time analysis. Professional recordings are consistently tight — but identically tight across all windows, which provides no discriminating signal between a rhythmically anchored section and any other.
- Original H010 (tonal stability) — constant near-identical values across every window of the same track. Key strength and dissonance were properties of the track overall, not of the window.
The diagnosis in one sentence from the design notes: “The two heuristics with the most form/tonal dominance are the two that measure track-level properties, not window-level properties.”
From a research framing in the literature review: “The likely novel pieces in your system are: applying evidence-fusion and frame-selection specifically to stem-aware analysis, where roles and cross-stem interactions are central; using the meta-layer to drive a conversational LLM that produces human-readable musical narrative, rather than just scalar predictions or tags; treating ‘interpretive frames’ as first-class objects in routing and prompt construction, effectively turning MIR insights into a controllable narrative UX.”
What's next
Window-level stem roles
The frame selector knows what kind of musical moment is happening. What it doesn’t yet know is what role each stem is actually playing inside it.
Right now, drums is always “rhythm” and bass is always “bass” — assigned statically at upload time, regardless of what’s actually selected. In a primarily tonal passage, the piano might be carrying the melodic foreground while the guitar plays a rhythmic pad role. The system can’t say that yet, and the narration reflects it. Window-level stem role classification is the next layer — computing roles from what’s actually in the selection rather than from defaults set at upload time.
▸technical detail — stem role classifier design
The current static assignments: drums is always rhythm, bass is always bass, vocals are always melody (unless semantic tags indicate backing choir). The “other” stem uses track-level onset rate, spectral centroid, and note count to pick between melody, harmony, and rhythm. Confidence values are hardcoded, not data-derived.
Three gaps the next iteration addresses:
- Phrase contiguity — distinguishing a melodic lead (sustained phrase-grouped notes) from a rhythmic fill (isolated events) within the window. Requires segmenting notes by pause duration and testing whether the resulting groups have melodic contour.
- Relative register — each stem’s pitch position relative to the other selected stems in the window. A piano playing in a high register is in a lead role; the same piano in a low register is in an accompaniment role.
- Cross-stem beat correlation — computing the actual onset offset between two stems (“bass enters consistently 12ms behind the kick”) rather than just whether they coincide. This requires beat-tracking timestamps as a phase anchor — which also unblocks the microtiming heuristic that currently falls back to track-level data.
The tools to compute all three already exist in the codebase. The work is wiring them into a window-level classification pass and integrating the result into the frame selector output.