ॐ  ·  IMI Library Project

A Living Archive of
Timeless Teachings

Complete project roadmap — workflow design, tooling decisions, phased implementation, and critical gaps for building a permanent, searchable, public-ready library.

45+Years of Teachings
~3kEdited Videos (Priority)
12,155FMP Records (Importable)
6Pipeline Steps
18moFull Timeline

The Core Pipeline

Six sequential steps from recordings to a searchable library entry. Immediate priority: ingest the ~3,000 recently created edited videos into the system. The 12,000 master recordings in FileMaker/Dropbox are a longer-term effort involving transcoding and renaming at scale.

Single Recording Journey
MP4
Source file
Extract
FFmpeg → MP3
Transcribe
AssemblyAI
Enrich
Claude API
Database
Auto-push
Review
Human QA
Multi-File Processing Timeline
Extraction Transcription Enrichment Database 0s 30s 60s 90s 120s 150s 180s File 1 extracting MP3 File 2 extracting File 3 extracting File 1 transcribing (AssemblyAI) .json File 2 transcribing File 1 enriching (Claude) push ✓ File 1 File 3 transcribing File 2 enriching ✓ File 2 File 3 enriching ✓ File 3
Each step runs independently. Async queues handle speed differences between steps. Arrows show data handoffs between stages.
System Architecture Diagram
CLOUD SERVICES AssemblyAI Transcription API Diarization • Chapters Anthropic Claude Sonnet 4.6 Enrichment • Speaker ID Supabase PostgreSQL + RLS Events + Media • Review status Cloudflare Pages Roadmap site Installer download USER'S MACHINE (localhost:8000) 👤 User Browser GUI HTML / CSS / JS Progress • Badges • Reports Folder select • Controls FastAPI Server Python / Uvicorn SSE events • Job management Async queue consumers REST SSE Dropbox Team sync layer Source MP4 recordings MP4s PIPELINE MODULES FFmpeg MP4 → MP3 Transcriber HTTP + polling Enricher Claude + Pydantic DB Client Supabase upsert queue queue direct OUTPUT FILES .mp3 .transcript.json .enrichment.json REFERENCE DATA Glossary Names Topics 🔍 Reviewer
Dashed outlines show cloud vs local boundaries. Dashed lines show cloud API calls. Each pipeline module connects straight up to its cloud service with no crossing arrows.
1
Transcode to MP4 NOT STARTED
Convert all incoming formats (DAT, DVD, AVI, MOV, etc.) to a standardised H.264 MP4 container. Apply canonical file naming at this stage: YYYY-MM-DD_[type]_[location]_[seq].mp4
FFmpeg (batch scripts) HandBrake (GUI fallback)
⚠ Rename every file to the canonical convention at this step — retroactively renaming thousands of records later is painful.
2
Extract Audio ✓ BUILT
Strip audio track from MP4 to high-quality MP3 for transcription. Browser-based GUI with real-time progress, batch processing, and completion reports. Runs entirely on the user's machine — no cloud uploads.
FFmpeg Python + FastAPI Browser GUI (localhost)
💡 Optionally apply light noise reduction at this stage (Audacity's RNNoise or iZotope RX) on poor-quality recordings before transcription — it measurably improves accuracy.
System Design & Data Flow
System Architecture

Everything runs locally on the user's machine. The only external touchpoints are the launch page (shyamgyaan.com) and GitHub (source code). No audio or video data leaves the machine.

Cloud (minimal)
shyamgyaan.com
Launch page & OS detection
 + 
GitHub
Source code & CI
↓ downloads launcher & latest code ↓
User's Machine
Launcher
Setup & start
Python Server
localhost
Browser GUI
HTML/CSS/JS
FFmpeg
Audio extraction
MP3 Files
Local disk
Data Flow (Single File)
MP4
Validate
FFmpeg
Extract audio
MP3
Save to target
Report
Log result
Invalid files (corrupted, no audio, output exists) are logged and skipped. Batch continues.
User Journey
Visit Site
OS detected
Download
Launcher
Auto Setup
~1 min first time
Select Folder
Configure
Process
Watch progress
Report
Download .txt
Returning users: run launcher (~5 sec), always gets latest version.
Requirements (6 issues)
#4 — Audio Extraction GUI
High
Browser-based GUI with folder picker for source directory. Display MP4 file count. Option to output to same or different folder. Transcription checkbox. Start button. Controls disabled during processing.
#5 — FFmpeg MP4 to MP3 Engine
High
Extract audio from each MP4 using FFmpeg. Save as MP3 (320kbps, best quality). Same filename with .mp3 extension. Skip corrupted files, files with no audio, and existing outputs — log each with descriptive message.
#6 — Progress Display (Batch + File)
High
Two progress bars: batch level ("3 of 47") and file level (0% → 100% based on video duration). Show current filename, elapsed time, per-file status. Real-time updates.
#7 — Completion Report
High
Summary report in GUI: total files, successes, skips, errors. Per-file status with messages. "Download Report" button saves as .txt. Includes timestamp, folder paths, and all details.
#8 — Auto-Transcription Trigger (Stub)
Medium
When transcription checkbox is enabled, trigger job after each successful extraction. Runs in parallel with continued extraction. Stubbed for now — logs "Transcription triggered for [filename]". Status in progress display and report.
#9 — Web Launcher
High
Launch page on shyamgyaan.com detects OS and browser. Downloads platform-specific launcher script. Installs Python + FFmpeg if missing. Pulls latest code. Starts local server and opens browser. Blocks mobile users.
Requirements (6 issues)
#10 — Performance
High
  • 1-hour MP4 extracted in under 60 seconds
  • Max 2GB RAM usage, with monitoring
  • CPU-aware parallel processing (never 100% cores)
  • Disk space check before starting, monitor during
#11 — Reliability
High
  • Crash recovery via checkpoint file — resume from where it left off
  • Idempotent — skip files that already have MP3 output
  • Works fully offline (extraction only)
  • Single file failure never crashes the batch
#12 — Usability
High
  • Plain English error messages — no stack traces
  • First-time setup under 2 minutes
  • Keyboard navigable, screen reader labels, WCAG AA contrast
#13 — Security & Privacy
High
  • No data leaves the machine during extraction
  • No telemetry without explicit consent
  • Launcher script is human-readable and auditable
  • App only accesses user-selected folders
#14 — Compatibility
High
  • macOS 12+, Windows 10/11
  • Chrome, Firefox, Safari (latest 2 versions)
  • Files up to 10GB, Unicode filenames, up to 100 files per folder
  • Mobile explicitly blocked at launch page
#15 — Logging & Diagnostics
Medium
  • Timestamped log file per run in target folder
  • GUI shows plain English; log captures full technical detail
  • "Share Log" button for easy troubleshooting
3
Transcribe with Diarization ✓ BUILT
Submit audio to transcription API. Must produce: timestamped transcript, speaker identification (diarization), auto-chapters, topic detection. Submit custom Sanskrit/spiritual vocabulary glossary to every service used.
AssemblyAI ✦ Recommended Whisper + Pyannote (local / private) Deepgram
⚠ Sanskrit terms, Hindi-English mixing, and yogic vocabulary will degrade accuracy on all tools. A custom vocabulary glossary + mandatory human review pass is non-negotiable.
System Design & Data Flow
System Architecture

Audio files are uploaded to AssemblyAI's cloud API for transcription. Direct HTTP calls handle upload, polling, and result retrieval (SDK bypassed for reliability). Supports parallel processing. Output JSON is saved locally alongside the source MP3.

Cloud (AssemblyAI)
AssemblyAI API
Transcription + diarization
 + 
Language Detection
Auto Hindi/English/Sanskrit
↑ upload MP3    ↓ return transcript ↓
User's Machine
CLI / GUI
Select files
Glossary
Sanskrit terms
HTTP Client
Upload + poll
.transcript.json
Local disk
Data Flow (Single File)
MP3
Validate
Upload
AssemblyAI
Transcribe
Poll until done
.tmp
Atomic write
.json
Rename
Atomic writes prevent corrupt output. Already-transcribed files are skipped (idempotent). Fatal API errors halt the batch.
Module Structure
models.py
Status, Result
client.py
HTTP client
transcriber.py
Orchestrator
 + 
vocabulary.py
Glossary loader
Follows the extract_audio module pattern: models → wrapper → orchestrator.
Output JSON Structure

Each .transcript.json contains these sections (~3-5 MB for a 2-hour recording):

meta
ID, duration, confidence
utterances
Speaker A/B/C
chapters
Summaries + timestamps
topics
IAB categories
entities
People, places
Requirements (19 items, 3 issues)
#36 — Core Module: Models, Client & Vocabulary
High
  • Sanskrit glossary loaded from text file and submitted as word_boost with boost_param="high" on every API call
  • Glossary toggleable via --no-glossary CLI flag (default: enabled)
  • Missing glossary when enabled = clear error, refuse to start
  • Glossary validated: < 10 MB, < 1000 terms, each term < 100 chars, valid text
  • Automatic language detection for mixed Hindi/English/Sanskrit
  • API key validated at call time with clear error on missing
#37 — Transcriber Orchestrator + CLI
High
  • Single MP3 transcribed via CLI, producing .transcript.json alongside source
  • Speaker diarization: output contains speaker-labeled utterances (A/B/C)
  • Auto chapters: timestamped chapter summaries
  • Topic detection: IAB category taxonomy results
  • Entity detection: people, places, concepts
  • Batch mode: process all MP3s in a directory with per-file progress
  • Idempotent: skip already-transcribed files, overridable with --force
  • --dry-run flag: show what would be transcribed without calling API
  • Batch halts on fatal errors (auth, billing), continues on per-file transient errors
  • Atomic writes: JSON written to .tmp, renamed on completion
#38 — Queue-Based Auto-Transcribe
High
  • Extraction queues completed MP3s for transcription (non-blocking producer-consumer)
  • Events emitted: transcription_queued, transcription_completed, transcription_failed
  • Queue consumer uses run_in_executor to avoid blocking the event loop
Requirements (8 items)
Code Quality
High
  • 80%+ test coverage on new code (48 test cases specified)
  • All existing tests pass, BDD feature files updated
  • black + ruff clean (formatting and linting)
Security & Privacy
High
  • No secrets logged or exposed
  • API key validated at call time, not import time
  • File paths validated (resolved, under allowed root, .mp3 only)
Performance
Medium
  • Queue consumer uses run_in_executor: must not block event loop
  • JSON output ~3-5 MB for 2-hour recordings: downstream consumers aware
  • Parallel batch processing with configurable concurrency (default: 3 files)
4
AI Post-Processing & Enrichment ✓ BUILT
Pass transcript text through an LLM to generate structured metadata: segment summaries, controlled-vocabulary topic tags, key quotes, estimated quality score, and speaker identification confirmation.
Claude API GPT-4o (fallback)
💡 Use a structured JSON prompt so the LLM returns consistently parseable output that flows directly into the database ingest step.
System Design & Data Flow
System Architecture

Reads .transcript.json from step 03, sends transcript content to Claude API (Sonnet 4.6) with structured JSON prompts, and saves .enrichment.json alongside the transcript. Uses Pydantic for guaranteed schema compliance.

Cloud (Anthropic API)
Claude Sonnet 4.6
Structured JSON output
 + 
Prompt Caching
90% input savings
↑ transcript + names roster    ↓ enrichment JSON ↓
User's Machine
CLI / GUI
Select files
Classify
Talk vs music
Claude API
Enrich + identify
.enrichment.json
Local disk
Data Flow (Single File)
.transcript.json
Read input
Classify
Talk / music / empty
Claude API
Pydantic validated
.tmp
Atomic write
.enrichment.json
Rename
Music/chanting recordings auto-classified and skip the Claude API call. Already-enriched files are skipped (idempotent).
Output Structure (.enrichment.json)
Refined Chapters
Improved summaries
Topic Tags
Controlled vocabulary
Key Quotes
With timestamps
Speaker Map
Names + confidence
Quality Score
1-5 rating
Cost Estimate

~$0.08/recording (Sonnet 4.6: $3/$15 per 1M tokens). ~$240 for 3,000 recordings. Combined with transcription ($810): ~$1,050 total for steps 03 + 04. Prompt caching reduces to ~$150. Batch API (deferred) reduces to ~$75.

Requirements (12 items, 2 issues)
#51 — Core Module: Models, Client, Names, Topics
High
  • Claude API wrapper (Sonnet 4.6) with Pydantic-validated structured JSON output
  • Ashram names roster loader (1,066 names from All_Names.xlsx)
  • Controlled topic vocabulary loader (~30 predefined topics)
  • Prompt templates loaded from files (system.txt, user.txt)
  • JSON schema for .enrichment.json output
  • Auto-classify recordings: talk, music/chanting, ceremony, mixed, empty
  • ANTHROPIC_API_KEY validated at call time
#52 — Enricher Orchestrator + CLI
High
  • Read .transcript.json, produce .enrichment.json alongside it
  • Refined chapter summaries (headline + summary rewritten, timestamps preserved)
  • Topic tags from controlled vocabulary only
  • 3-5 key quotes with speaker and matched utterance timestamps
  • Quality score 1-5 with notes
  • Speaker name mapping: A/B/C → real names with confidence (high/medium/low/none) and evidence
  • Music/chanting recordings get minimal enrichment, skip Claude API
  • Batch + single file with 3-attempt retry
  • CLI: --file, --dir, --force, --dry-run
  • Idempotent: skip existing unless --force
  • Empty transcript and Claude refusal handling
  • Cost tracking (tokens used, cost estimate) per file
Requirements (6 items)
Code Quality
High
  • 80%+ test coverage on new code (39 test cases + 3 prompt evals)
  • All existing tests pass
  • black + ruff clean (formatting and linting)
Security & Privacy
High
  • No secrets logged or exposed
  • ANTHROPIC_API_KEY validated at call time, not import time
  • Transcript content sent to Anthropic cloud — team aware of data flow
Performance & Cost
Medium
  • Prompt caching enabled for system prompt (90% input savings)
  • Send utterances + chapters to Claude, NOT the words array (5-10x token savings)
  • Cost tracking per file (tokens_used, cost_estimate_usd in meta)
  • Sequential batch for v1; Batch API (50% off) deferred
5
Human Review Queue ✓ BUILT
A team member reviews AI-generated transcript, summaries, and tags before the record is marked "published." Corrects Sanskrit errors, adds contextual metadata (setting, location, series, tags), and assigns speaker names via autocomplete from the ashram names roster.
Custom Review UI (Cloudflare Pages) Supabase (database)
Shared web app at library.shyamgyaan.com/review — password-protected, no user accounts needed. Static HTML/CSS/JS talks directly to Supabase.
System Design & Data Flow
System Architecture

Static HTML/CSS/JS hosted on Cloudflare Pages. No server — the browser talks directly to Supabase using the JavaScript client loaded via CDN. Password gate is server-side via Cloudflare Pages Functions middleware (SHA-256 hashed, 30-day session cookies).

Cloud
Cloudflare Pages
Static hosting + CDN
 + 
Supabase
PostgreSQL + RLS
↓ serves review UI    ↑↓ reads/writes recordings ↓
Reviewer's Browser
Password Gate
Shared password
Review Form
Edit all fields
Approve & Save
draft → published
Review Workflow
Pipeline
Pushes draft records
Review Queue
Forward/back nav
Published
Library tab
Requirements
Data Model — Setting + Tags
Core
Setting (single select): Satsang, Tea, Trip, Interview, Film. Tags (multi-select checkboxes): Music, Message, Revelation, Esoteric, Celebration. Celebration triggers sub-dropdown. Interview/Film disables Series, Esoteric, Revelation, Message.
Location Autocomplete
Core
223 deduplicated locations (306 raw FMP values normalized via location-mapping.csv). Matches anywhere in string (not just start). Mapping file tracks raw → canonical for FMP import.
Speaker Name Autocomplete
Core
1,066 names from ashram roster. Speaker map shows label → name with confidence. Names editable in both speaker section and utterance table rows.
AI Title Suggestion
Enhancement
Title auto-populated from filename (date prefix stripped). If filename is generic (e.g., "Gyaan 1"), AI-generated headline shown as suggestion with Accept/Revert toggle.
Quality Assessment
Enhancement
Audio Quality and Video Quality (Good/Fair/Poor) — human judgment. AI Score (1-5) and AI Quality Notes — system generated. All in one section.
Three-Tab Interface (Queue, Review, Library)
Core
Queue (filterable list of draft recordings with search and pagination), Review (view/edit one recording with forward/back navigation), Library (browse published recordings with advanced search, sort, and filter).
Pipeline Protection
Core
Re-running the pipeline on published recordings does NOT overwrite human edits. push_recording() checks status before upsert — skips if published.
Requirements
Usability
Non-technical team members can start reviewing without training. Target: under 3 minutes per recording for light review + metadata entry. Desktop-first, tablet-friendly.
Security
Password-protected via Cloudflare Pages Functions middleware (server-side SHA-256, 30-day session cookies). Supabase anon key with RLS policies. No user accounts needed for a small trusted team. Data is non-sensitive recording metadata.
Deployment
Static files on Cloudflare Pages. Auto-deploy from git push to main. No server to maintain. Cache-busting version strings on CSS/JS.
6
Library Database IN PROGRESS
Approved records live in Supabase PostgreSQL with full metadata across two tables: events (immutable facts — date, location, setting, description) and media (recording metadata — tape counts, duration, quality). A event_media_view joins them for the Review UI. Includes FMP legacy data migration (12,155 records) with field mapping, location dedup, and tag normalization.
Supabase PostgreSQL FMP Import Script Typesense (search, future)
Schema: events + media tables (split from original single recordings table). Pipeline re-runs protected — published records never overwritten. FMP import script handles 26-column CSV with location mapping (306→223 values) and tag normalization (489→465).
Database Design
Data Flow — Two Ingest Paths
PATH 1: New recordings (pipeline)
Pipeline
Extract → Transcribe → Enrich
Supabase
events + media (draft)
Review UI
Human QA
Published
Library tab
PATH 2: Legacy data (FMP import)
FMP CSV
12,155 records
import_fmp.py
Map • Normalize
Supabase
events + media
Review UI
Browse • Search
Both paths write to the same events + media tables. source field tracks origin. Future: Typesense search index.
Events + Media Tables (split schema)
Events Table (immutable facts)
id UUID (PK)
event_key TEXT UNIQUE
event_date DATE
time_of_day TEXT
sequence INTEGER
setting TEXT
location TEXT
description TEXT
series TEXT
topic_tags TEXT[]
notes TEXT
source TEXT
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ
Media Table (recording metadata)
id UUID (PK)
event_id UUID (FK → events)
tape_count INTEGER
tape_numbers TEXT[]
duration_seconds FLOAT
quality TEXT
recorded_by TEXT
cataloged_by TEXT
cataloged_at TEXT
related_tape_count INTEGER
related_tape_numbers TEXT[]
notes TEXT
source TEXT
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ
event_media_view (joined view)
Joins events + media with COALESCE for overlapping fields. Used as a drop-in replacement for legacy recordings queries in the Review UI. Provides a flat view of all event + media data for browsing, filtering, and export.
Controlled Vocabularies
Setting: Satsang, Tea, Trip, Interview, Film
Tags: Music, Message, Revelation, Esoteric, Celebration
Celebrations: Christmas, Holi, New Years, Shiv Ratri, Krishn Janmashtami, Guru Poornima, Swami's Birthday, Valentine's Day
Series: DKS — Divine Knowledge Series (expandable)
Locations: 223 canonical values after FMP dedup (Kullu, On the Land, Mukt Vyom, Naggar, Manali, Montreal, ...)
Speakers: 1,066 names from ashram roster
Indexes
status, recording_type, created_at, category — B-tree
topic_tags, enrichment, speaker_map — GIN (JSONB/array search)
FileMaker Pro Legacy Import
Migration Pipeline
FMP CSV
12,155 rows × 26 cols
Parse
Field mapping
Normalize
Locations • Tags
Supabase
events + media
Field Mapping (26 FMP columns)
Imported → Events
Recording type → setting
Description → description
Tags/topics → topic_tags
Location → location
Notes → notes
Recording date → event_date
Time of day → time_of_day
Sequence → sequence
Date+time composite → event_key
Series/event name → series
Imported → Media
Tape count → tape_count
Tape number(s) → tape_numbers
Duration (min) → duration_seconds (×60)
Quality → quality
Recorded by → recorded_by
Reviewed date/by → cataloged_by/at
Related tape refs → related_tape_numbers
Skipped (6 cols)
Tag count (derived), Entry date (system default), Flag (11 rows), 3 nearly-empty unknowns
Data Normalization
Locations
306 raw values deduplicated to 223 canonical locations via location-mapping.csv. Matches anywhere in string.
Tags
489 raw tag values normalized to 465 via tag-mapping.csv. Semicolon-separated multi-value parsing.
Event Keys
Natural key from date + time of day + sequence (e.g. 1989-01-10 af). Unique per event.
Import Script
scripts/import_fmp.py — CLI with --dry-run, --limit N, and --wipe flags. Batch upserts (500 rows/batch) into events + media tables. 33 unit tests covering event keys, tags, locations, date parsing, and tape number cleaning.
Requirements
Capacity
12,155 FMP records + ~3,000 pipeline-processed recordings on Supabase. Events + media split keeps table sizes manageable. Upgrade to Pro ($25/mo) if storage exceeds free tier limit.
Data Integrity
RLS policies on events and media tables. Event key is unique natural key (upsert on conflict). updated_at auto-trigger on every row change. Pipeline and FMP imports both write with source field to track data origin.
Migration Status
Import script built and tested (33 tests). Field mapping, location mapping (306→223), and tag mapping (489→465) files ready. Schema supports both pipeline-processed and FMP-imported records via the same events + media tables.

Phased Roadmap

Three phases over 18 months — from foundational infrastructure to internal tools at scale. A potential public access phase is outlined below but is subject to a future team decision and is not assumed.

Phase 1 — Foundation
Build the Base
0 – 3 months
  • Master metadata schema defined (events + media tables, controlled vocabularies, 223 locations, 465 tags, 1,066 speakers)
  • Canonical file naming established: YYYY-MM-DD_[type]_[location]_[seq].mp4
  • Priority: Ingest ~3,000 recently created edited videos into the system first
  • FFmpeg audio extraction with browser GUI, batch processing, and completion reports
  • AssemblyAI integration with Sanskrit/spiritual vocabulary glossary, parallel processing, direct HTTP client
  • Pipeline built: extraction → transcription (AssemblyAI) → enrichment (Claude API) → Supabase
  • Custom review UI deployed at library.shyamgyaan.com/review (Cloudflare Pages + Supabase)
  • Quality tiers defined (Good/Fair/Poor) in Review UI; bulk triage before processing still needed
  • FMP legacy import: 12,155 records mapped, normalized, and importable via scripts/import_fmp.py (events + media schema, 33 tests)
  • Later: Transcode and rename 12k master recordings at scale; link media files to imported FMP events
Phase 2 — Scale
Process the Backlog
3 – 9 months
  • Run full backlog of recordings through the automated pipeline
  • Migrate master files from Dropbox to object storage (Backblaze B2 or AWS S3)
  • Add Typesense for full-text transcript search across all records
  • Build internal search and browse interface for the team
  • Implement master → edit parent-child schema; begin cataloguing edits
  • Add public/private/restricted access flags to all records
  • Ongoing human review staffing — target 200+ records/month reviewed
Phase 3 — Public Access To Be Determined
Open the Doors?
9 – 18 months
⚠ This phase is exploratory only. Whether the archive becomes publicly accessible is a future decision that requires explicit approval from project stakeholders. Nothing below is assumed or committed — it is included here only to illustrate what a public phase could look like if the team decides to pursue it.
  • ?Launch public library on Vimeo OTT or Uscreen (fastest path to market)
  • ?Implement user accounts, access tiers (free / subscription / pay-per-view)
  • ?Sync metadata from internal DB to public platform
  • ?Expose public search across transcripts and topics
  • ?Review all records for public/private designation before launch
  • ?Evaluate custom platform build: Supabase + Mux + Stripe (based on usage/revenue)
  • ?Migrate to custom platform if/when needed for full control

Key Decisions

The major architectural and tooling choices — with recommendations and trade-offs clearly laid out.

🗄️
Primary Database Where records live; team collaboration; API access for automation
FileMaker Pro (current)
Expensive licensing, poor API ecosystem, limited web integration. Not suitable going forward.
Airtable Ruled Out
Skipped — went straight to Supabase. No point building an integration we'd replace.
Supabase (PostgreSQL) Selected ✦
In production. Events + media tables (split schema), RLS policies, auto-push from pipeline, FMP import script, custom review UI talks directly to it. Free tier handles 15,000+ records.
Omeka
Purpose-built for digital archives. Open source. Less modern UI; worth evaluating if developer resources are limited long-term.
🎙️
Transcription Service Accuracy, diarization, topic detection, Sanskrit vocabulary support
AssemblyAI Recommended ✦
Transcription + speaker diarization + auto-chapters + topic detection in one API. Best for archival batch work.
Whisper + Pyannote
Open source, runs locally (privacy+cost advantage). Best accuracy, but requires more setup; no built-in summaries.
Deepgram
Very fast, good accuracy. Better suited for real-time use cases; AssemblyAI edges it for archival batch work.
🔍
Full-Text Search Layer Searching across thousands of transcripts by keyword, topic, quote
Typesense Recommended ✦
Open source, self-hostable, fast. Excellent for transcript search. Lower cost than Algolia at scale.
Algolia
Best-in-class search UX, generous free tier. Easier to get started, but gets expensive as the corpus grows.
Postgres Full-Text (Supabase)
No separate service needed once on Supabase. Good for moderate scale; doesn't match dedicated search services.
🎬
Public Streaming & Commerce User access, subscriptions, pay-per-view, video delivery
Vimeo OTT / Uscreen Launch ✦
Purpose-built for streaming libraries with commerce. Fastest path to public access. Use while learning user needs.
Supabase + Mux + Stripe
Full custom control. 3–6 month build. Migrate to this once revenue and user base justify the investment.
⚙️
Workflow Automation Orchestrator Connecting transcription → LLM → DB without heavy custom code
n8n Recommended ✦
Open source, self-hostable, visual workflow builder. Connects Dropbox, AssemblyAI, Claude, Supabase without heavy coding. Maintainable by non-developers. Note: Pipeline currently runs as a Python FastAPI app on localhost — n8n integration deferred.
Make (formerly Integromat)
Cloud-hosted visual automation. Easier setup than n8n, but recurring cost and less control.
Custom Python Scripts
Maximum flexibility. Best if you have a dedicated developer. Highest long-term maintenance burden.
☁️
Master File Storage Where the actual MP4, audio, and transcript files live at scale
Dropbox (current)
Not designed as a media archive at this scale. Expensive per GB, limited API for direct processing pipelines.
Backblaze B2 Recommended ✦
Very low cost per GB (~$6/TB/mo), S3-compatible API, reliable. Best price/performance for large video archives.
AWS S3
Industry standard, deep integrations. More expensive than B2 but integrates seamlessly with the broader AWS ecosystem.
Cloudflare R2
No egress fees — excellent if you're serving video directly. Worth considering for the public streaming phase.

Full Tech Stack

Every tool in the recommended stack, organized by function. Short-term choices prioritize speed; long-term choices prioritize scale and control.

File Processing
FFmpeg
Transcoding, audio extraction, batch scripts
HandBrake
GUI fallback for transcoding
iZotope RX / Audacity
Audio cleanup before transcription
Transcription & NLP
AssemblyAI
Primary: transcript + diarization + topics
Whisper (OpenAI)
Local/private processing option
Pyannote
Speaker diarization (if using Whisper)
AI Enrichment
Claude API
Summaries, tagging, structured metadata
Custom vocabulary glossary
Sanskrit / yogic terms submitted to all APIs
Automation
n8n
Visual workflow orchestration
Make
Cloud automation fallback
Dropbox API
Trigger pipeline on new file uploads
Database (Short-term)
Supabase (PostgreSQL)
Main database — events + media tables, event_media_view, RLS policies, auto-push from pipeline + FMP import
Cloudflare Pages
Static hosting for review UI + strategy site — password-protected, CDN edge delivery
Database (Long-term)
Supabase
PostgreSQL + auth + real-time + REST API
Typesense
Full-text transcript search layer
File Storage
Backblaze B2
Master file storage (video, audio, transcripts)
Cloudflare R2
Public delivery (no egress fees)
Dropbox
Team sync layer (not primary archive)
Public Platform
Vimeo OTT / Uscreen
Launch: streaming + commerce + user accounts
Mux
Custom build: adaptive video streaming
Stripe
Custom build: subscriptions + payments

Critical Gaps

Things not explicitly covered in the original workflow that must be addressed before scaling. These will cause expensive problems if ignored.

No Master Metadata Schema Defined
Resolved. Events + media schema defined with controlled vocabularies: Setting (5 values), Tags (5 checkboxes), Celebrations (8 values), Locations (223 deduplicated from FMP), Series (expandable), Speakers (1,066 names). Import script and Review UI both enforce these vocabularies.
File Naming Convention Not Established
Resolved. Canonical format established: YYYY-MM-DD_[type]_[location]_[seq].mp4. Enforced in pipeline code — files that don't match are rejected.
3
Master → Edit Relationship Not Modelled
The database schema must explicitly represent the parent-child relationship between a master recording and its edited versions, with edit metadata (who edited, when, what changed, why). Edits are a "living" part of the system and need a dedicated schema from day one.
4
No Quality Triage Step
Some recordings are degraded — poor audio, background noise, partially inaudible. There is no step in the current workflow to assess recording quality before committing transcription resources. A triage pass (even a simple 3-tier quality flag) should happen early.
5
Public vs. Private Access Not Addressed
Some recordings may contain personal conversations or content not intended for public release. A public/private/restricted access field must be added to the schema now — retroactively reviewing thousands of records before a public launch will be a major bottleneck if not done incrementally.
Sanskrit & Mixed-Language Vocabulary Gap
Resolved. Custom Sanskrit/spiritual vocabulary glossary built and submitted as word_boost on every AssemblyAI API call. Stored in pipeline/transcribe/vocabulary/. Automatic language detection handles Hindi-English mixing.
7
No Staffing Plan for Human Review
At ~10 minutes of review per recording, processing 2,000 recordings means 333 hours of human review work. This is not a one-time task — new edits will always be added. Human review must be treated as a staffed, ongoing role with a defined capacity and throughput target.

Pitfalls to Avoid

Hard-earned lessons from similar archival projects. These are the most common ways this kind of work goes wrong.

Process Risk
Trying to automate before understanding the problem
Run 50 representative recordings through the full workflow manually first. You'll discover the Sanskrit vocabulary problem, audio quality edge cases, and metadata inconsistencies when they're cheap to fix — not after building an automated system around incorrect assumptions.
Team Risk
One-person bottleneck on database entry
If a single person controls adding or approving records, the entire pipeline stalls when they're unavailable. Build multi-user workflows from day one with clear roles and access levels.
Technical Risk
Dropbox as the permanent master archive
Dropbox is a sync tool, not an archive. It's expensive per GB at video scale, has no meaningful API for batch processing pipelines, and is a single point of failure. Migrate masters to object storage (B2 or S3) before scaling.
Quality Risk
Skipping human review to move faster
AI summaries of a spiritual teacher's talks will contain meaningful errors — misheard Sanskrit, wrong speaker attribution, summaries that miss the point. Publishing without review risks putting incorrect or misleading content in front of students and seekers.
Architecture Risk
Building a fully custom platform from day one
A custom database, custom UI, custom auth, and custom commerce layer is a 6+ month project with ongoing maintenance. Use best-of-breed services (Supabase, AssemblyAI, Cloudflare Pages) and connect them. Build custom only when you've outgrown the off-the-shelf tools. Update: the custom review UI was built in one session using static HTML/JS + Supabase — no server, no framework, minimal maintenance.
Data Risk
Inconsistent metadata across the backlog
Without a defined schema and controlled vocabularies, team members will use different tags, location names, and descriptions for the same concepts. This makes search and filtering unreliable. Update: Schema defined with controlled vocabularies — Setting (5 values), Tags (5 checkboxes), Celebrations (8 values), Locations (223 deduplicated from FMP), Series (expandable dropdown), Quality tiers (Good/Fair/Poor). Speaker names autocomplete from 1,066-name roster.
Legal / Privacy Risk
No access control framework before going public
Not all recordings are appropriate for public release. Personal conversations, informal talks, or sensitive content mixed into the archive could cause harm if published without review. Implement a per-record access flag early and review before any public launch.
Cost Risk
Underestimating transcription costs at scale
Several thousand hours of audio through a paid transcription API adds up quickly. Run a cost calculation against your full archive before committing to a service. Consider Whisper (local/free) for the bulk backlog and paid APIs for ongoing new recordings only.