A multi-stage avatar pipeline, behind a dedicated service.
Inputs in, reusable avatar identity and finished videos out. The hard part was not calling any single model. It was coordinating a long running, multi provider media pipeline where each stage has different latency, cost, failure behavior, and review semantics.
- Role
- Engineer · pipeline + service boundary
- Team
- Gen AI service
- Stack
- FastAPI, Python, Postgres, RabbitMQ
- Surface
- Internal job APIs, callbacks
- Status
- Shipping
- Sanitized
- Yes
A small set of inputs becomes a reusable avatar identity and generated video output.
The workflow validates input media, builds a consistent character identity, generates speech, plans visuals, renders scene assets, aligns audio and video, and produces a stitched final clip. I designed it around explicit contracts, durable stage state, resumable execution, and provider adapters, so the product could support creative iteration without coupling the core backend to AI execution.
- Avatar creation
- Voice cloning
- Script driven generation
- Storyboard planning
- Background generation
- Pose & b-roll branches
- Lip sync
- Stitching
- Final enhancement
Avatar video combines several expensive, failure prone steps.
Validating media, building character identity, synthesizing speech, planning visuals, rendering scene assets, aligning audio and video, stitching final output. Originally much of this lived inside the product backend.
AI execution lived inside the product backend.
- · Pipeline iteration coupled to product release.
- · Provider swaps touched product models.
- · Retries were one off background tasks.
- · Failure recovery was the whole pipeline.
A dedicated Gen AI service owns execution.
- · Product backend submits validated snapshots.
- · Service runs the media pipeline.
- · No imports of product models, no direct DB writes.
- · Failure localized to a single stage.
Versionable job contracts
For avatar creation, output generation, retries, approvals, and voice cloning.
Pipeline as persisted state
Stages modeled as state transitions instead of one off background tasks.
Provider adapters
Image, video, audio, storyboard, background, b-roll, lip sync, and enhancement behind capability interfaces.
Lifecycle events
Structured accepted, started, progress, completed, and failed events back to the product backend.
Resumable execution
Targeted resume from a named failed or user approved stage, without re running upstream work.
Contract & pipeline tests
Around routing, idempotency, persistence, callback delivery, and output shape.
A standalone service, durable state, and queue backed workers.
Requests enter through internal job APIs, normalize into payload snapshots, persist with idempotency guarantees, and dispatch to background workers. The product backend stays the source of truth for users, workspaces, and permissions. The Gen AI service operates on immutable snapshots.
Product backend
- Users & workspaces
- Permissions
- Public API
- Job records
- Response shapes
Gen AI service
- Job APIs (idempotent)
- Persistence + snapshots
- Worker queues
- Pipeline modules
- Provider adapters
The identity is an asset graph, not a single file.
Later outputs reference a stable neutral image, a character sheet, and an optional voice profile. Each stage records its inputs and outputs, making retries deterministic for operations while still allowing creative regeneration where appropriate.
- 01ValidateUser-provided input images checked.
- 02Neutral imageConsistent avatar baseline generated.
- 03Character sheetVisual consistency reference built.
- 04PersistAssets written through storage adapter.
- 05EmitProgress and completion to product backend.
A branching DAG with shared stage primitives.
Single scene, multi scene, b-roll, pose driven, and approval driven paths share common stages but route differently based on payload flags and existing approved assets. Each stage exposes useful state to the frontend, like generating audio, building storyboard, or stitching video, and gives support a precise stage to retry from.
Five routes share these primitives:
- single scene
- multi scene
- b-roll
- pose based
- approval gated
Contract first execution
Jobs carry strict payload snapshots: avatar identity, script, output settings, approved assets, storyboard context, retry context.
Stage level observability
Each stage persists input, output, status, attempts, errors. Audit trail without exposing provider internals.
Branch aware orchestration
The same output API routes to script generation, multi scene storyboarding, background regeneration, or approval flows.
Provider portability
Orchestration calls capability oriented adapters. Model and provider changes happen behind stable interfaces.
Media timeline handling
Storyboard scenes carry timing, transcript, pose, and visual metadata so downstream assembly can reason about continuity.
Snapshot payloads
Jobs run from captured product state. The service does not need access to the product DB to execute generation.
Long media pipelines need human review and selective regeneration.
Each retry carries context describing where to resume and whether the system should rerun the downstream pipeline or only regenerate a single step. This avoids repeating expensive work and preserves approved assets.
- 01Full output generation
- 02Retry from stage
- 03Script approval
- 04Background regeneration
- 05Background approval
- 06Storyboard approval
- 07Video asset approval
Idempotent job creation
Repeated submissions with the same idempotency key map to the same job.
Durable step records
Every major stage has persisted input, output, status, and error state.
Structured outbound events
Callbacks stored before delivery, allowing delivery retry and audit.
Dedicated queues
Avatar jobs separated from marketing, callback, and maintenance work.
Provider abstraction
Provider request details isolated from orchestration code.
Snapshot payloads
Jobs run from captured product state, reducing coupling to product DB.
Failure localization
Failed jobs retain the failed stage and structured error payload.
Private internal APIs and tokenized service calls.
Secrets, provider credentials, storage locations, and account specific configuration come from environment configuration and are not represented here. User media and generated outputs are referenced through controlled asset URLs in job snapshots and event payloads. The service does not need direct access to product user credentials or the product database to execute the pipeline.
A cleaner architecture for long running avatar media generation.
AI execution moved behind a dedicated service boundary. Product logic stayed in the core backend, while the Gen AI service gained ownership over orchestration, provider integrations, retries, and pipeline observability. Resumable, observable, provider portable, and suited to human in the loop creative review.