Turning a 47 minute briefing into seven 30 second clips.
An AI assisted media pipeline that transforms long form news video into short, vertical, social ready clips. Transcript intelligence finds the story. Computer vision keeps the right subject in a 9:16 frame. Producers stay in the loop where it matters.
- Role
- Backend pipeline + orchestration
- Team
- Repurposing pipeline
- Stack
- FastAPI, Celery, FFmpeg, Whisper, Remotion
- Surface
- Producer dashboard, async APIs
- Status
- Shipping
- Sanitized
- Yes
One long broadcast becomes a set of standalone clips with their own thumbnails, captions, and metadata.
The pipeline normalizes the source, transcribes with word level timing, segments the transcript into news chapters, finds strong hooks, scores candidates for editorial usefulness, snaps cuts to real word boundaries, and reframes the result for mobile. Each stage is a separate job, observable on its own, retryable on its own, and easy to evolve without rewriting the whole flow.
- Long form to short form
- Transcript-aware selection
- Chapter segmentation
- Hook detection
- Word-boundary cuts
- Vertical reframing
- Thumbnail generation
- Caption styling
- Compilation formats
News editing repeats the same slow, manual loop for every broadcast.
A producer scrubs the full recording, identifies strong moments, crops to vertical, generates captions, picks a thumbnail, and exports several deliverables. Then they do it again for the next broadcast. The pipeline needed to assist with all of that while keeping human review possible where it mattered.
One editor, one clip, the same loop on every broadcast.
- · Scrubbing through 45 minute recordings for hooks.
- · Manual cropping for vertical.
- · Captions and thumbnails done by hand.
- · No easy way to package recaps or analysis.
One editor per story, with the pipeline doing the heavy lifting.
- · Transcript driven hook detection per chapter.
- · Word boundary aware cut points.
- · Scene aware vertical reframing.
- · Multiple compilation formats from the same source.
Multi stage pipeline design
Modeled ingestion, transcription, analysis, vision, and rendering as separate jobs with explicit handoffs.
Asynchronous orchestration
Coordinated long running media jobs through Celery so the upload API stays responsive and the worker fans out parallel work.
Transcript intelligence
Multi pass chaptering, hook detection, completeness checks, and scoring on top of word level transcripts.
Vision driven reframing
Face, person, motion, and scene signals merged into a smoothed crop track for vertical news output.
Editorial output formats
Standalone clips plus recap, political, and analysis compilation modes off the same pipeline primitives.
Producer review surface
Upload, progress, clip preview, metadata regeneration, subtitle styles, render, and download in the dashboard.
Responsive uploads, with the expensive work handed off to background workers.
The dashboard uploads source media to a FastAPI surface that returns a task id immediately. Celery workers run transcoding, pause removal, audio extraction, transcription, transcript analysis, vision, and render preparation in the background. The frontend polls for stage level progress while the worker fans out parallel chapter analysis and per clip rendering.
Producer dashboard
- Source upload
- Stage progress polling
- Clip preview
- Metadata regen + render
- Subtitle style picker
Clip Engine service
- Upload API (task id)
- Worker pool
- Stage-level status
- Artifact storage
- Render manifests
Every later stage trusts the ingest contract.
Normalizing the codec, removing dead air, and producing word level timing up front means transcript analysis, vision, and rendering can all rely on consistent timing and a known media format. Each stage publishes human readable progress so the dashboard knows where the job is.
- 01NormalizeTranscode to H.264 / AAC for predictable downstream handling.
- 02Trim pausesStrip long silences before transcription tightens timing.
- 03Extract audioPull a clean audio track for the speech model.
- 04TranscribeLocal Whisper with word level timing.
- 05Hand offHand the timestamped transcript to the analysis layer.
Not every exciting sentence is a standalone story.
Instead of one giant prompt, the pipeline runs several smaller passes against the timestamped transcript. Each pass has a narrow job, a structured output, and an explicit failure mode. Weak or incomplete clips get rejected before any expensive media work runs.
Five passes, narrow outputs:
- chapter split
- completeness check
- hook detection
- boundary fitting
- candidate scoring
- 01Chapter splitLong transcript becomes a list of story chapters with start and end timestamps.47:22 transcript→9 chapters
- 02CompletenessSetup-only and ending-only chapters get dropped before further work.9 chapters→6 eligible
- 03Hook detectionThe strongest opening line of each eligible chapter is identified.6 eligible→6 hooks
- 04Boundary fittingStart and end are snapped to nearby word boundaries with safe padding.6 hooks→6 candidates
- 05ScoringHook strength, context, arc, ending, and publishability rolled into a rank.6 candidates→top 4
A wide news frame, kept steady on the subject.
News footage cuts between anchors, b-roll, guests, split screens, graphics, and wipe transitions. The vertical layer reads several signals at once and produces a smoothed crop track that stays on the right subject without darting around during transitions.
- Face and person detection (MediaPipe)
- Focal point tracking across frames
- Motion pre-filter on static frames
- Rolling average smoothing
- Scene classification (anchor · b-roll · split)
- Wipe transition handling
- Last known fallback on miss
One pipeline, several editorial lenses.
Standalone clipping is the default. The same primitives also drive recap, political, and analysis modes by changing the question the pipeline asks. Which clips belong to the same developing story. What sequence gives the viewer enough context. What is factual update versus reaction versus analysis.
Standalone clips
defaultStrongest 20 to 40 second moments from each story chapter, ranked by hook quality, context, and ending.
Recap
what happened so farChronological summary of an ongoing story across multiple broadcasts. Catches the viewer up before the latest update.
Political
topic-basedTheme based compilation around a political topic. Pulls statements, reactions, and related developments into one package.
Analysis
explanatoryInterpretive clips with explanatory value. Cause and effect framing, expert commentary, and broader context.
Async upload, async worker
The API returns a task id immediately. The Celery worker runs heavy media work in the background while the dashboard polls for stage level state.
Multi pass AI filtering
Chaptering, hook detection, completeness, scoring, and metadata all happen as separate passes. Easier to debug, easier to throttle.
Word boundary aware cuts
LLM picked timestamps get snapped to nearby word starts and ends so clips never open or close mid syllable.
Scene aware vertical track
Faces, persons, motion, and scene class fuse into a smoothed crop track that handles anchors, b-roll, split, and wipes.
Render ready manifests
The worker writes structured Remotion props per clip: source url, duration, captions, tracking, frame features, subtitle style.
Reusable artifacts
Pre-cut clips, vertical variants, word timings, tracking data, thumbnails, and metadata are stored as discrete pieces.
Stage level progress
Every major step reports a human readable state so stalled jobs are easy to identify.
Retryable units of work
Pipeline stages are scoped so failures don't poison the whole run.
Parallel chapter analysis
Chapters fan out so long videos don't bottleneck on one story at a time.
Artifact reuse
Completed outputs are stored as discrete pieces and reused across previews and renders.
JSON parsing fallbacks
Imperfect LLM responses degrade through default structures, score thresholds, and structural filters.
Length safeguards
Very long candidates get capped. Very short ones get skipped before render time.
Cleanup of intermediates
Extracted audio, transcoded intermediates, and render props are cleaned up after processing.
No proprietary prompts, infrastructure, or scoring logic.
Internal repository paths, customer identifiers, API keys, service credentials, deployment topology, proprietary scoring prompts, and company private naming are intentionally omitted. Source media and generated artifacts are referenced through controlled storage in production. This page focuses on the architecture and engineering patterns rather than any production specific configuration.
A scalable foundation for AI assisted news clipping.
The pipeline turns long form news into short form assets without removing the editor from the loop. It coordinates transcription, LLM reasoning, computer vision, and rendering as separate stages, each one observable and replaceable. The same primitives also drive recap, political, and analysis compilations off the same source video.