I’ve always been the kind of person who pauses YouTube videos every thirty seconds to Google something. A math proof doesn’t make sense? Pause, tab over to ChatGPT, paste the timestamp context, lose my place in the video, get distracted by other videos, forget what I was watching. Rinse and repeat.
I wanted an AI tutor that lived *inside* the video—something that knew exactly what was on screen at 04:23, could hear the professor’s explanation in real-time during a livestream, and answer questions without interrupting the flow. Not a sidebar. Not a separate app. A ghost in the player.
That’s how the Yututor project was born. It’s a Chrome extension that overlays a translucent chat interface directly on YouTube videos, powered by a dual-mode architecture that handles both pre-recorded lectures and livestreams. Here’s how I built it, broke it, and rebuilt it.
I started with a naive assumption: I’ll just use `chrome.tabCapture` to grab audio from YouTube, send it to Whisper, and stream the transcript to an LLM. Simple, right?
Then I hit Manifest V3.
Chrome’s extension architecture is designed to be ephemeral. Service Workers—the background scripts that power modern extensions—terminate after 30 seconds of inactivity. They can’t maintain persistent WebSocket connections. They can’t directly access the DOM of the page they’re observing. And `tabCapture`? It requires active user consent every single time you want to capture a tab, unless you’re willing to get into the weeds with `chrome.offscreen` documents.
I spent three days trying to maintain a WebSocket connection from a Service Worker to my vLLM backend. The connection would drop exactly when the user needed it most—during a long, silent pause in a lecture. The solution was [Chrome’s Offscreen Documents API](https://developer.chrome.com/docs/extensions/reference/api/offscreen), a relatively new escape hatch that lets you spin up a hidden, headless page to run long-lived scripts.
// background.ts
chrome.offscreen.createDocument({
url: 'offscreen.html',
reasons: ['WEBRTC', 'USER_MEDIA'],
justification: 'Maintain WebSocket connection for real-time transcription'
});This offscreen document became my “server” inside the browser. It held the WebSocket connection to my vLLM Realtime API instance, buffered audio chunks, and communicated with the Service Worker via message passing. It felt hacky. It worked.
YouTube is two different platforms wearing the same skin. Pre-recorded videos have accessible transcripts (either auto-generated or uploaded). Livestreams have ephemeral HLS streams with no text fallback. Building for both required fundamentally different ingestion pipelines.
VOD Mode (The Easy Path)
For recorded videos, I intercept YouTube’s `ytInitialPlayerResponse` object—a massive JSON blob that YouTube injects into the page containing video metadata and caption tracks. I extract the transcript URL, fetch the XML, parse it into a searchable array with millisecond-accurate timestamps, and store it in IndexedDB. When the user asks “What did he say about eigenvectors?”, I search the local cache using a sliding window approach, pulling the 30 seconds of context surrounding the current playback time.
Live Mode (The Hard Path)
Livestreams have no transcripts. I had to capture raw audio, transcribe it in real-time, and buffer the text for on-demand retrieval. This meant using `chrome.tabCapture` to get a `MediaStream`, piping it through the Web Audio API to resample to 16kHz PCM (Whisper’s sweet spot), and streaming chunks to my backend.
The catch: YouTube’s livestreams often have 10-30 seconds of latency. If a student asks a question based on something just said, the audio I’m capturing is technically “in the past” relative to the visual feed. I solved this by maintaining a circular buffer of the last 60 seconds of transcription, synchronized against the video’s `currentTime` property. When a query fires, I retrieve the text chunk closest to the visual timestamp, not the audio capture time.
// AudioWorklet processor for live capture
class Resampler extends AudioWorkletProcessor {
process(inputs: Float32Array[][]) {
const input = inputs[0][0];
// Downsample 48kHz → 16kHz, convert to Int16 for Whisper
const downsampled = this.downsample(input);
this.port.postMessage(downsampled);
return true;
}
}The biggest UX challenge was overlaying an interface without breaking YouTube. YouTube’s CSS is a battle-tested fortress of specificity wars. Injecting a vanilla `div` meant getting styled into oblivion by YouTube’s aggressive reset sheets.
The solution was the [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM). I created a closed shadow root attached to a host element injected at the root of the page. Inside this shadow boundary, my Tailwind CSS classes lived in isolation, immune to YouTube’s global styles and safe from their JavaScript event listeners.
const host = document.createElement('div');
host.id = 'yututor-host';
document.body.appendChild(host);
const shadow = host.attachShadow({ mode: 'closed' });
shadow.innerHTML = `
<style>${tailwindStyles}</style>
<div class="fixed bottom-4 right-4 w-96 ...">
<!-- Chat UI here -->
</div>
`;I also had to handle YouTube’s Single Page Application (SPA) navigation. YouTube doesn’t reload the page when you click a new video—it swaps the content dynamically. My extension had to detect these navigation events using the `History API` and `MutationObserver`, tearing down and rebuilding the shadow DOM context for each new video without leaking memory.
## Multimodal Context: When Audio Isn’t Enough
Math lectures don’t live in audio alone. They live in the whiteboard. If a user asks “Can you explain that equation?” at 12:34, the AI needs to *see* the equation.
I implemented a frame extraction pipeline using the HTML5 Canvas API. When a query is triggered, the extension:
1. Pauses the video momentarily (optional, user-configurable)
2. Draws the current video frame to an offscreen canvas
3. Converts the canvas to a base64 PNG
4. Sends it alongside the transcript context to the vLLM backend
This required careful handling of CORS restrictions—YouTube’s video elements are often `cross-origin`, making canvas reads throw security errors. I had to ensure the video element had the `crossOrigin="anonymous"` attribute set before any frame capture attempts.
vLLM’s Realtime API supports WebSocket streaming, sending token deltas as they’re generated. I wanted the UI to feel like ChatGPT—text appearing word-by-word rather than waiting for the full response.
In Svelte, this meant managing a reactive store that appends chunks as they arrive:
// store.ts
export const responseStream = writable('');
// In the WebSocket handler
ws.onmessage = (event) => {
const delta = JSON.parse(event.data).choices[0].delta.content;
responseStream.update(text => text + delta);
};The challenge was handling the “jitter” of network latency. Audio transcription might lag 2-3 seconds behind the video, but the LLM response should reference the correct timestamp. I tag every context packet with the video’s `currentTime` at the moment of query initiation, not at the moment of transcription completion. This way, even if the pipeline is asynchronous, the AI’s answer is anchored to the correct visual moment.
## What I Learned (And What I’d Do Differently)
**Service Workers are not servers.** Treating them like long-running processes is a recipe for dropped connections and frustrated users. The Offscreen Documents API is still bleeding-edge and poorly documented, but it’s essential for any extension doing real-time media processing.
**The YouTube DOM is a moving target.** My selector for the video player broke three times during development as YouTube A/B tested new class names. I’ve since moved to semantic targeting—looking for `video` tags and specific ARIA attributes rather than minified class hashes.
**Latency matters more than accuracy.** In a tutoring context, a “good enough” answer delivered in 500ms is better than a perfect answer in 5 seconds. I optimized the context window to only send the last 30 seconds of transcript plus the current frame, rather than the full video history. The LLM has fewer tokens to process, and the user gets faster feedback.
Yututor is currently far from a polished product but it does work currently. I'm going to work on it again with more features and one day maybe I will release it on Chrome. I wont include the source code just in case I do intend to release it but it is really promising and I am very excited.
I’m currently experimenting with using WebGPU, allowing the extension to run entirely offline for privacy-conscious users. I’m also exploring “Study Mode”—automatically generated Anki cards based on video chapters, extracted via the same transcript pipeline.
Yututor started as a tool to stop me from alt-tabbing. It ended up being a lesson in browser architecture, real-time systems, and the beautiful hackery required to bend a Chrome extension into a desktop-class application. If you’ve ever wanted to build something that lives in the cracks of the web—between the page and the browser—don’t let Manifest V3 scare you. The escape hatches are there. You just have to find them.