●2024·Gen AI·Video pipelineClip Engine

Turning a 47 minute briefing into seven 30 second clips.

An AI assisted media pipeline that transforms long form news video into short, vertical, social ready clips. Transcript intelligence finds the story. Computer vision keeps the right subject in a 9:16 frame. Producers stay in the loop where it matters.

Role: Backend pipeline + orchestration
Team: Repurposing pipeline
Stack: FastAPI, Celery, FFmpeg, Whisper, Remotion
Surface: Producer dashboard, async APIs
Status: Shipping
Sanitized: Yes

·Demo

01Overview

One long broadcast becomes a set of standalone clips with their own thumbnails, captions, and metadata.

The pipeline normalizes the source, transcribes with word level timing, segments the transcript into news chapters, finds strong hooks, scores candidates for editorial usefulness, snaps cuts to real word boundaries, and reframes the result for mobile. Each stage is a separate job, observable on its own, retryable on its own, and easy to evolve without rewriting the whole flow.

Pipeline capabilities9

Long form to short form
Transcript-aware selection
Chapter segmentation
Hook detection
Word-boundary cuts
Vertical reframing
Thumbnail generation
Caption styling
Compilation formats

02Product problem

News editing repeats the same slow, manual loop for every broadcast.

A producer scrubs the full recording, identifies strong moments, crops to vertical, generates captions, picks a thumbnail, and exports several deliverables. Then they do it again for the next broadcast. The pipeline needed to assist with all of that while keeping human review possible where it mattered.

Before

One editor, one clip, the same loop on every broadcast.

· Scrubbing through 45 minute recordings for hooks.
· Manual cropping for vertical.
· Captions and thumbnails done by hand.
· No easy way to package recaps or analysis.

After

One editor per story, with the pipeline doing the heavy lifting.

· Transcript driven hook detection per chapter.
· Word boundary aware cut points.
· Scene aware vertical reframing.
· Multiple compilation formats from the same source.

03My role

01·

Multi stage pipeline design

Modeled ingestion, transcription, analysis, vision, and rendering as separate jobs with explicit handoffs.

02·

Asynchronous orchestration

Coordinated long running media jobs through Celery so the upload API stays responsive and the worker fans out parallel work.

03·

Transcript intelligence

Multi pass chaptering, hook detection, completeness checks, and scoring on top of word level transcripts.

04·

Vision driven reframing

Face, person, motion, and scene signals merged into a smoothed crop track for vertical news output.

05·

Editorial output formats

Standalone clips plus recap, political, and analysis compilation modes off the same pipeline primitives.

06·

Producer review surface

Upload, progress, clip preview, metadata regeneration, subtitle styles, render, and download in the dashboard.

04System design

Responsive uploads, with the expensive work handed off to background workers.

The dashboard uploads source media to a FastAPI surface that returns a task id immediately. Celery workers run transcoding, pause removal, audio extraction, transcription, transcript analysis, vision, and render preparation in the background. The frontend polls for stage level progress while the worker fans out parallel chapter analysis and per clip rendering.

Entry surfaceFastAPI upload API

OrchestratorCelery + Redis

Render layerRemotion compositions

Request flowasync by default

SurfaceNext.js

Producer dashboard

Source upload
Stage progress polling
Clip preview
Metadata regen + render
Subtitle style picker

BackendFastAPI + Celery

Clip Engine service

Upload API (task id)
Worker pool
Stage-level status
Artifact storage
Render manifests

Upload + dispatchSource video, configuration, task id returned immediately.

Stage progresstranscode · pauses · audio · transcribe · analyze · render.

Inside the workertools + models

FFmpeg

Whisper

Gemini

MediaPipe

Remotion

05Ingestion flow

Every later stage trusts the ingest contract.

Normalizing the codec, removing dead air, and producing word level timing up front means transcript analysis, vision, and rendering can all rely on consistent timing and a known media format. Each stage publishes human readable progress so the dashboard knows where the job is.

01
NormalizeTranscode to H.264 / AAC for predictable downstream handling.
02
Trim pausesStrip long silences before transcription tightens timing.
03
Extract audioPull a clean audio track for the speech model.
04
TranscribeLocal Whisper with word level timing.
05
Hand offHand the timestamped transcript to the analysis layer.

user facing progressThe dashboard sees transcoding, pause removal, audio extraction, transcription, transcript analysis, and clip rendering as discrete states. No long opaque wait.

06Transcript intelligence

Not every exciting sentence is a standalone story.

Instead of one giant prompt, the pipeline runs several smaller passes against the timestamped transcript. Each pass has a narrow job, a structured output, and an explicit failure mode. Weak or incomplete clips get rejected before any expensive media work runs.

Five passes, narrow outputs:

chapter split
completeness check
hook detection
boundary fitting
candidate scoring

Multi pass analysisllm + deterministic snap

01
Chapter splitLong transcript becomes a list of story chapters with start and end timestamps.
47:22 transcript→9 chapters
02
CompletenessSetup-only and ending-only chapters get dropped before further work.
9 chapters→6 eligible
03
Hook detectionThe strongest opening line of each eligible chapter is identified.
6 eligible→6 hooks
04
Boundary fittingStart and end are snapped to nearby word boundaries with safe padding.
6 hooks→6 candidates
05
ScoringHook strength, context, arc, ending, and publishability rolled into a rank.
6 candidates→top 4

Word boundary snaplossless padding

andtheQ3numbersexceededallourtargetsbyawidemargin

llm picked · 0:14.21 → 0:18.40snapped · 0:14.07 → 0:18.62

07Vertical reframing

A wide news frame, kept steady on the subject.

News footage cuts between anchors, b-roll, guests, split screens, graphics, and wipe transitions. The vertical layer reads several signals at once and produces a smoothed crop track that stays on the right subject without darting around during transitions.

Face and person detection (MediaPipe)
Focal point tracking across frames
Motion pre-filter on static frames
Rolling average smoothing
Scene classification (anchor · b-roll · split)
Wipe transition handling
Last known fallback on miss

16:9 source → 9:16 outputlive track

b-roll

face 0.94

9:16 crop

0:00scene cuts · 60:32

Q3 numbers exceeded targets

anchorb-rollsplitwipe

08Editorial formats

One pipeline, several editorial lenses.

Standalone clipping is the default. The same primitives also drive recap, political, and analysis modes by changing the question the pipeline asks. Which clips belong to the same developing story. What sequence gives the viewer enough context. What is factual update versus reaction versus analysis.

0:28

0:30

0:32

0:34

Standalone clips

default

Strongest 20 to 40 second moments from each story chapter, ranked by hook quality, context, and ending.

Mar 02Initial report

Mar 04Government response

Mar 07Field update

TodayLatest development

Recap

what happened so far

Chronological summary of an ongoing story across multiple broadcasts. Catches the viewer up before the latest update.

Political

topic-based

Theme based compilation around a political topic. Pulls statements, reactions, and related developments into one package.

0.35

0.65

0.92

0.55

0.78

explanatory weighttop 5 picked

Analysis

explanatory

Interpretive clips with explanatory value. Cause and effect framing, expert commentary, and broader context.

09Technical highlights

POST /jobs→200 task_8c1

GET /jobs/8c1→transcribing 64%

GET /jobs/8c1→analyzing 3 / 6

GET /jobs/8c1→done · 4 clips

Async upload, async worker

The API returns a task id immediately. The Celery worker runs heavy media work in the background while the dashboard polls for stage level state.

chapter›complete›hook›snap›score

Multi pass AI filtering

Chaptering, hook detection, completeness, scoring, and metadata all happen as separate passes. Easier to debug, easier to throttle.

llm boundarysnapped to word

Word boundary aware cuts

LLM picked timestamps get snapped to nearby word starts and ends so clips never open or close mid syllable.

ank

wip

b-r

spl

ank

b-r

raw focus tracksmoothed

Scene aware vertical track

Faces, persons, motion, and scene class fuse into a smoothed crop track that handles anchors, b-roll, split, and wipes.

render.props.jsonremotion

source:https://…/clip_3.mp4

duration:30.42

subtitles:[{w, t0, t1}…]

tracking:{x, y, scale}

style:ticker.dark

Render ready manifests

The worker writes structured Remotion props per clip: source url, duration, captions, tracking, frame features, subtitle style.

clip.mp4

vert.mp4

words.json

track.json

thumb.jpg

meta.json

Reusable artifacts

Pre-cut clips, vertical variants, word timings, tracking data, thumbnails, and metadata are stored as discrete pieces.

10Reliability patterns

Stage level progress

Every major step reports a human readable state so stalled jobs are easy to identify.

Retryable units of work

Pipeline stages are scoped so failures don't poison the whole run.

∥

Parallel chapter analysis

Chapters fan out so long videos don't bottleneck on one story at a time.

Artifact reuse

Completed outputs are stored as discrete pieces and reused across previews and renders.

JSON parsing fallbacks

Imperfect LLM responses degrade through default structures, score thresholds, and structural filters.

Length safeguards

Very long candidates get capped. Very short ones get skipped before render time.

Cleanup of intermediates

Extracted audio, transcoded intermediates, and render props are cleaned up after processing.

11Privacy & security

Sanitized for portfolio

No proprietary prompts, infrastructure, or scoring logic.

Internal repository paths, customer identifiers, API keys, service credentials, deployment topology, proprietary scoring prompts, and company private naming are intentionally omitted. Source media and generated artifacts are referenced through controlled storage in production. This page focuses on the architecture and engineering patterns rather than any production specific configuration.

12Outcome

A scalable foundation for AI assisted news clipping.

The pipeline turns long form news into short form assets without removing the editor from the loop. It coordinates transcription, LLM reasoning, computer vision, and rendering as separate stages, each one observable and replaceable. The same primitives also drive recap, political, and analysis compilations off the same source video.

FasterManual scrubbing replaced with stage level automation.

ObservableEach pipeline stage reports its own status and errors.

ModularStages and providers can be improved independently.

EditorialProducer still owns metadata, style, and final render.

Previous · AI Marketing Have a question about this? Write to me →

Case study · 03 of 04 · clip-engine