2024·Gen AI·Video pipelineClip Engine

Turning a 47 minute briefing into seven 30 second clips.

An AI assisted media pipeline that transforms long form news video into short, vertical, social ready clips. Transcript intelligence finds the story. Computer vision keeps the right subject in a 9:16 frame. Producers stay in the loop where it matters.

Role
Backend pipeline + orchestration
Team
Repurposing pipeline
Stack
FastAPI, Celery, FFmpeg, Whisper, Remotion
Surface
Producer dashboard, async APIs
Status
Shipping
Sanitized
Yes
01Overview

One long broadcast becomes a set of standalone clips with their own thumbnails, captions, and metadata.

The pipeline normalizes the source, transcribes with word level timing, segments the transcript into news chapters, finds strong hooks, scores candidates for editorial usefulness, snaps cuts to real word boundaries, and reframes the result for mobile. Each stage is a separate job, observable on its own, retryable on its own, and easy to evolve without rewriting the whole flow.

Pipeline capabilities9
  • Long form to short form
  • Transcript-aware selection
  • Chapter segmentation
  • Hook detection
  • Word-boundary cuts
  • Vertical reframing
  • Thumbnail generation
  • Caption styling
  • Compilation formats
02Product problem

News editing repeats the same slow, manual loop for every broadcast.

A producer scrubs the full recording, identifies strong moments, crops to vertical, generates captions, picks a thumbnail, and exports several deliverables. Then they do it again for the next broadcast. The pipeline needed to assist with all of that while keeping human review possible where it mattered.

Before

One editor, one clip, the same loop on every broadcast.

  • · Scrubbing through 45 minute recordings for hooks.
  • · Manual cropping for vertical.
  • · Captions and thumbnails done by hand.
  • · No easy way to package recaps or analysis.
After

One editor per story, with the pipeline doing the heavy lifting.

  • · Transcript driven hook detection per chapter.
  • · Word boundary aware cut points.
  • · Scene aware vertical reframing.
  • · Multiple compilation formats from the same source.
03My role
01·

Multi stage pipeline design

Modeled ingestion, transcription, analysis, vision, and rendering as separate jobs with explicit handoffs.

02·

Asynchronous orchestration

Coordinated long running media jobs through Celery so the upload API stays responsive and the worker fans out parallel work.

03·

Transcript intelligence

Multi pass chaptering, hook detection, completeness checks, and scoring on top of word level transcripts.

04·

Vision driven reframing

Face, person, motion, and scene signals merged into a smoothed crop track for vertical news output.

05·

Editorial output formats

Standalone clips plus recap, political, and analysis compilation modes off the same pipeline primitives.

06·

Producer review surface

Upload, progress, clip preview, metadata regeneration, subtitle styles, render, and download in the dashboard.

04System design

Responsive uploads, with the expensive work handed off to background workers.

The dashboard uploads source media to a FastAPI surface that returns a task id immediately. Celery workers run transcoding, pause removal, audio extraction, transcription, transcript analysis, vision, and render preparation in the background. The frontend polls for stage level progress while the worker fans out parallel chapter analysis and per clip rendering.

Entry surfaceFastAPI upload API
OrchestratorCelery + Redis
Render layerRemotion compositions
Request flowasync by default
SurfaceNext.js

Producer dashboard

  • Source upload
  • Stage progress polling
  • Clip preview
  • Metadata regen + render
  • Subtitle style picker
BackendFastAPI + Celery

Clip Engine service

  • Upload API (task id)
  • Worker pool
  • Stage-level status
  • Artifact storage
  • Render manifests
Upload + dispatchSource video, configuration, task id returned immediately.
Stage progresstranscode · pauses · audio · transcribe · analyze · render.
Inside the workertools + models
FFmpeg
Whisper
Gemini
MediaPipe
Remotion
05Ingestion flow

Every later stage trusts the ingest contract.

Normalizing the codec, removing dead air, and producing word level timing up front means transcript analysis, vision, and rendering can all rely on consistent timing and a known media format. Each stage publishes human readable progress so the dashboard knows where the job is.

  1. 01
    NormalizeTranscode to H.264 / AAC for predictable downstream handling.
  2. 02
    Trim pausesStrip long silences before transcription tightens timing.
  3. 03
    Extract audioPull a clean audio track for the speech model.
  4. 04
    TranscribeLocal Whisper with word level timing.
  5. 05
    Hand offHand the timestamped transcript to the analysis layer.
user facing progressThe dashboard sees transcoding, pause removal, audio extraction, transcription, transcript analysis, and clip rendering as discrete states. No long opaque wait.
06Transcript intelligence

Not every exciting sentence is a standalone story.

Instead of one giant prompt, the pipeline runs several smaller passes against the timestamped transcript. Each pass has a narrow job, a structured output, and an explicit failure mode. Weak or incomplete clips get rejected before any expensive media work runs.

Five passes, narrow outputs:

  • chapter split
  • completeness check
  • hook detection
  • boundary fitting
  • candidate scoring
Multi pass analysisllm + deterministic snap
  1. 01
    Chapter splitLong transcript becomes a list of story chapters with start and end timestamps.
    47:22 transcript9 chapters
  2. 02
    CompletenessSetup-only and ending-only chapters get dropped before further work.
    9 chapters6 eligible
  3. 03
    Hook detectionThe strongest opening line of each eligible chapter is identified.
    6 eligible6 hooks
  4. 04
    Boundary fittingStart and end are snapped to nearby word boundaries with safe padding.
    6 hooks6 candidates
  5. 05
    ScoringHook strength, context, arc, ending, and publishability rolled into a rank.
    6 candidatestop 4
Word boundary snaplossless padding
andtheQ3numbersexceededallourtargetsbyawidemargin
llm picked · 0:14.21 → 0:18.40snapped · 0:14.07 → 0:18.62
07Vertical reframing

A wide news frame, kept steady on the subject.

News footage cuts between anchors, b-roll, guests, split screens, graphics, and wipe transitions. The vertical layer reads several signals at once and produces a smoothed crop track that stays on the right subject without darting around during transitions.

  • Face and person detection (MediaPipe)
  • Focal point tracking across frames
  • Motion pre-filter on static frames
  • Rolling average smoothing
  • Scene classification (anchor · b-roll · split)
  • Wipe transition handling
  • Last known fallback on miss
16:9 source → 9:16 outputlive track
b-roll
face 0.94
9:16 crop
0:00scene cuts · 60:32
Q3 numbers exceeded targets
anchorb-rollsplitwipe
08Editorial formats

One pipeline, several editorial lenses.

Standalone clipping is the default. The same primitives also drive recap, political, and analysis modes by changing the question the pipeline asks. Which clips belong to the same developing story. What sequence gives the viewer enough context. What is factual update versus reaction versus analysis.

0:28
0:30
0:32
0:34

Standalone clips

default

Strongest 20 to 40 second moments from each story chapter, ranked by hook quality, context, and ending.

Mar 02Initial report
Mar 04Government response
Mar 07Field update
TodayLatest development

Recap

what happened so far

Chronological summary of an ongoing story across multiple broadcasts. Catches the viewer up before the latest update.

Political

topic-based

Theme based compilation around a political topic. Pulls statements, reactions, and related developments into one package.

0.35
0.65
0.92
0.55
0.78
explanatory weighttop 5 picked

Analysis

explanatory

Interpretive clips with explanatory value. Cause and effect framing, expert commentary, and broader context.

09Technical highlights
POST /jobs200 task_8c1
GET /jobs/8c1transcribing 64%
GET /jobs/8c1analyzing 3 / 6
GET /jobs/8c1done · 4 clips

Async upload, async worker

The API returns a task id immediately. The Celery worker runs heavy media work in the background while the dashboard polls for stage level state.

chaptercompletehooksnapscore

Multi pass AI filtering

Chaptering, hook detection, completeness, scoring, and metadata all happen as separate passes. Easier to debug, easier to throttle.

llm boundarysnapped to word

Word boundary aware cuts

LLM picked timestamps get snapped to nearby word starts and ends so clips never open or close mid syllable.

ank
wip
b-r
spl
ank
b-r
raw focus tracksmoothed

Scene aware vertical track

Faces, persons, motion, and scene class fuse into a smoothed crop track that handles anchors, b-roll, split, and wipes.

render.props.jsonremotion
source:https://…/clip_3.mp4
duration:30.42
subtitles:[{w, t0, t1}…]
tracking:{x, y, scale}
style:ticker.dark

Render ready manifests

The worker writes structured Remotion props per clip: source url, duration, captions, tracking, frame features, subtitle style.

clip.mp4
vert.mp4
words.json
track.json
thumb.jpg
meta.json

Reusable artifacts

Pre-cut clips, vertical variants, word timings, tracking data, thumbnails, and metadata are stored as discrete pieces.

10Reliability patterns
P
01

Stage level progress

Every major step reports a human readable state so stalled jobs are easy to identify.

R
02

Retryable units of work

Pipeline stages are scoped so failures don't poison the whole run.

03

Parallel chapter analysis

Chapters fan out so long videos don't bottleneck on one story at a time.

A
04

Artifact reuse

Completed outputs are stored as discrete pieces and reused across previews and renders.

J
05

JSON parsing fallbacks

Imperfect LLM responses degrade through default structures, score thresholds, and structural filters.

L
06

Length safeguards

Very long candidates get capped. Very short ones get skipped before render time.

C
07

Cleanup of intermediates

Extracted audio, transcoded intermediates, and render props are cleaned up after processing.

11Privacy & security
Sanitized for portfolio

No proprietary prompts, infrastructure, or scoring logic.

Internal repository paths, customer identifiers, API keys, service credentials, deployment topology, proprietary scoring prompts, and company private naming are intentionally omitted. Source media and generated artifacts are referenced through controlled storage in production. This page focuses on the architecture and engineering patterns rather than any production specific configuration.

12Outcome

A scalable foundation for AI assisted news clipping.

The pipeline turns long form news into short form assets without removing the editor from the loop. It coordinates transcription, LLM reasoning, computer vision, and rendering as separate stages, each one observable and replaceable. The same primitives also drive recap, political, and analysis compilations off the same source video.

FasterManual scrubbing replaced with stage level automation.
ObservableEach pipeline stage reports its own status and errors.
ModularStages and providers can be improved independently.
EditorialProducer still owns metadata, style, and final render.
Case study · 03 of 04 · clip-engine