IMI Library — Project Roadmap

The Core Pipeline

Six sequential steps from recordings to a searchable library entry. Immediate priority: ingest the ~3,000 recently created edited videos into the system. The 12,000 master recordings in FileMaker/Dropbox are a longer-term effort involving transcoding and renaming at scale.

Flow Single Recording Journey

▶

MP4

Source file

→

Extract

FFmpeg → MP3

→

Transcribe

AssemblyAI

→

Enrich

Claude API

→

Database

Auto-push

→

Review

Human QA

Parallel Multi-File Processing Timeline

▶

Each step runs independently. Async queues handle speed differences between steps. Arrows show data handoffs between stages.

Architecture System Architecture Diagram

▶

Dashed outlines show cloud vs local boundaries. Dashed lines show cloud API calls. Each pipeline module connects straight up to its cloud service with no crossing arrows.

1

Transcode to MP4 NOT STARTED

Convert all incoming formats (DAT, DVD, AVI, MOV, etc.) to a standardised H.264 MP4 container. Apply canonical file naming at this stage: YYYY-MM-DD_[type]_[location]_[seq].mp4

FFmpeg (batch scripts) HandBrake (GUI fallback)

⚠ Rename every file to the canonical convention at this step — retroactively renaming thousands of records later is painful.

2

Extract Audio ✓ BUILT

Strip audio track from MP4 to high-quality MP3 for transcription. Browser-based GUI with real-time progress, batch processing, and completion reports. Runs entirely on the user's machine — no cloud uploads.

FFmpeg Python + FastAPI Browser GUI (localhost)

💡 Optionally apply light noise reduction at this stage (Audacity's RNNoise or iZotope RX) on poor-quality recordings before transcription — it measurably improves accuracy.

Architecture System Design & Data Flow

▶

System Architecture

Everything runs locally on the user's machine. The only external touchpoints are the launch page (shyamgyaan.com) and GitHub (source code). No audio or video data leaves the machine.

Cloud (minimal)

shyamgyaan.com

Launch page & OS detection

+

GitHub

Source code & CI

↓ downloads launcher & latest code ↓

User's Machine

Launcher

Setup & start

→

Python Server

localhost

→

Browser GUI

HTML/CSS/JS

→

FFmpeg

Audio extraction

→

MP3 Files

Local disk

Data Flow (Single File)

MP4

Validate

→

FFmpeg

Extract audio

→

MP3

Save to target

→

Report

Log result

Invalid files (corrupted, no audio, output exists) are logged and skipped. Batch continues.

User Journey

Visit Site

OS detected

→

Download

Launcher

→

Auto Setup

~1 min first time

→

Select Folder

Configure

→

Process

Watch progress

→

Report

Download .txt

Returning users: run launcher (~5 sec), always gets latest version.

Functional Requirements (6 issues)

▶

#4 — Audio Extraction GUI

High

Browser-based GUI with folder picker for source directory. Display MP4 file count. Option to output to same or different folder. Transcription checkbox. Start button. Controls disabled during processing.

#5 — FFmpeg MP4 to MP3 Engine

High

Extract audio from each MP4 using FFmpeg. Save as MP3 (320kbps, best quality). Same filename with .mp3 extension. Skip corrupted files, files with no audio, and existing outputs — log each with descriptive message.

#6 — Progress Display (Batch + File)

High

Two progress bars: batch level ("3 of 47") and file level (0% → 100% based on video duration). Show current filename, elapsed time, per-file status. Real-time updates.

#7 — Completion Report

High

Summary report in GUI: total files, successes, skips, errors. Per-file status with messages. "Download Report" button saves as .txt. Includes timestamp, folder paths, and all details.

#8 — Auto-Transcription Trigger (Stub)

Medium

When transcription checkbox is enabled, trigger job after each successful extraction. Runs in parallel with continued extraction. Stubbed for now — logs "Transcription triggered for [filename]". Status in progress display and report.

#9 — Web Launcher

High

Launch page on shyamgyaan.com detects OS and browser. Downloads platform-specific launcher script. Installs Python + FFmpeg if missing. Pulls latest code. Starts local server and opens browser. Blocks mobile users.

View all in GitHub →

Non-Functional Requirements (6 issues)

▶

#10 — Performance

High

1-hour MP4 extracted in under 60 seconds
Max 2GB RAM usage, with monitoring
CPU-aware parallel processing (never 100% cores)
Disk space check before starting, monitor during

#11 — Reliability

High

Crash recovery via checkpoint file — resume from where it left off
Idempotent — skip files that already have MP3 output
Works fully offline (extraction only)
Single file failure never crashes the batch

#12 — Usability

High

Plain English error messages — no stack traces
First-time setup under 2 minutes
Keyboard navigable, screen reader labels, WCAG AA contrast

#13 — Security & Privacy

High

No data leaves the machine during extraction
No telemetry without explicit consent
Launcher script is human-readable and auditable
App only accesses user-selected folders

#14 — Compatibility

High

macOS 12+, Windows 10/11
Chrome, Firefox, Safari (latest 2 versions)
Files up to 10GB, Unicode filenames, up to 100 files per folder
Mobile explicitly blocked at launch page

#15 — Logging & Diagnostics

Medium

Timestamped log file per run in target folder
GUI shows plain English; log captures full technical detail
"Share Log" button for easy troubleshooting

View all in GitHub →

3

Transcribe with Diarization ✓ BUILT

Submit audio to transcription API. Must produce: timestamped transcript, speaker identification (diarization), auto-chapters, topic detection. Submit custom Sanskrit/spiritual vocabulary glossary to every service used.

AssemblyAI ✦ Recommended Whisper + Pyannote (local / private) Deepgram

⚠ Sanskrit terms, Hindi-English mixing, and yogic vocabulary will degrade accuracy on all tools. A custom vocabulary glossary + mandatory human review pass is non-negotiable.

Architecture System Design & Data Flow

▶

System Architecture

Audio files are uploaded to AssemblyAI's cloud API for transcription. Direct HTTP calls handle upload, polling, and result retrieval (SDK bypassed for reliability). Supports parallel processing. Output JSON is saved locally alongside the source MP3.

Cloud (AssemblyAI)

AssemblyAI API

Transcription + diarization

+

Language Detection

Auto Hindi/English/Sanskrit

↑ upload MP3 ↓ return transcript ↓

User's Machine

CLI / GUI

Select files

→

Glossary

Sanskrit terms

→

HTTP Client

Upload + poll

→

.transcript.json

Local disk

Data Flow (Single File)

MP3

Validate

→

Upload

AssemblyAI

→

Transcribe

Poll until done

→

.tmp

Atomic write

→

.json

Rename

Atomic writes prevent corrupt output. Already-transcribed files are skipped (idempotent). Fatal API errors halt the batch.

Module Structure

models.py

Status, Result

→

client.py

HTTP client

→

transcriber.py

Orchestrator

+

vocabulary.py

Glossary loader

Follows the extract_audio module pattern: models → wrapper → orchestrator.

Output JSON Structure

Each .transcript.json contains these sections (~3-5 MB for a 2-hour recording):

Phased Roadmap

Three phases over 18 months — from foundational infrastructure to internal tools at scale. A potential public access phase is outlined below but is subject to a future team decision and is not assumed.

Phase 1 — Foundation

Build the Base

0 – 3 months

✓Master metadata schema defined (events + media tables, controlled vocabularies, 223 locations, 465 tags, 1,066 speakers)
✓Canonical file naming established: YYYY-MM-DD_[type]_[location]_[seq].mp4
→Priority: Ingest ~3,000 recently created edited videos into the system first
✓FFmpeg audio extraction with browser GUI, batch processing, and completion reports
✓AssemblyAI integration with Sanskrit/spiritual vocabulary glossary, parallel processing, direct HTTP client
✓Pipeline built: extraction → transcription (AssemblyAI) → enrichment (Claude API) → Supabase
✓Custom review UI deployed at library.shyamgyaan.com/review (Cloudflare Pages + Supabase)
◐Quality tiers defined (Good/Fair/Poor) in Review UI; bulk triage before processing still needed
✓FMP legacy import: 12,155 records mapped, normalized, and importable via scripts/import_fmp.py (events + media schema, 33 tests)
→Later: Transcode and rename 12k master recordings at scale; link media files to imported FMP events

Phase 2 — Scale

Process the Backlog

3 – 9 months

→Run full backlog of recordings through the automated pipeline
→Migrate master files from Dropbox to object storage (Backblaze B2 or AWS S3)
→Add Typesense for full-text transcript search across all records
→Build internal search and browse interface for the team
→Implement master → edit parent-child schema; begin cataloguing edits
→Add public/private/restricted access flags to all records
→Ongoing human review staffing — target 200+ records/month reviewed

Phase 3 — Public Access To Be Determined

Open the Doors?

9 – 18 months

?Launch public library on Vimeo OTT or Uscreen (fastest path to market)
?Implement user accounts, access tiers (free / subscription / pay-per-view)
?Sync metadata from internal DB to public platform
?Expose public search across transcripts and topics
?Review all records for public/private designation before launch
?Evaluate custom platform build: Supabase + Mux + Stripe (based on usage/revenue)
?Migrate to custom platform if/when needed for full control

Key Decisions

The major architectural and tooling choices — with recommendations and trade-offs clearly laid out.

🗄️

Primary Database Where records live; team collaboration; API access for automation

FileMaker Pro (current)

Expensive licensing, poor API ecosystem, limited web integration. Not suitable going forward.

Airtable Ruled Out

Skipped — went straight to Supabase. No point building an integration we'd replace.

Supabase (PostgreSQL) Selected ✦

In production. Events + media tables (split schema), RLS policies, auto-push from pipeline, FMP import script, custom review UI talks directly to it. Free tier handles 15,000+ records.

Omeka

Purpose-built for digital archives. Open source. Less modern UI; worth evaluating if developer resources are limited long-term.

🎙️

Transcription Service Accuracy, diarization, topic detection, Sanskrit vocabulary support

AssemblyAI Recommended ✦

Transcription + speaker diarization + auto-chapters + topic detection in one API. Best for archival batch work.

Whisper + Pyannote

Open source, runs locally (privacy+cost advantage). Best accuracy, but requires more setup; no built-in summaries.

Deepgram

Very fast, good accuracy. Better suited for real-time use cases; AssemblyAI edges it for archival batch work.

🔍

Full-Text Search Layer Searching across thousands of transcripts by keyword, topic, quote

Typesense Recommended ✦

Open source, self-hostable, fast. Excellent for transcript search. Lower cost than Algolia at scale.

Algolia

Best-in-class search UX, generous free tier. Easier to get started, but gets expensive as the corpus grows.

Postgres Full-Text (Supabase)

No separate service needed once on Supabase. Good for moderate scale; doesn't match dedicated search services.

🎬

Public Streaming & Commerce User access, subscriptions, pay-per-view, video delivery

Vimeo OTT / Uscreen Launch ✦

Purpose-built for streaming libraries with commerce. Fastest path to public access. Use while learning user needs.

Supabase + Mux + Stripe

Full custom control. 3–6 month build. Migrate to this once revenue and user base justify the investment.

⚙️

Workflow Automation Orchestrator Connecting transcription → LLM → DB without heavy custom code

n8n Recommended ✦

Open source, self-hostable, visual workflow builder. Connects Dropbox, AssemblyAI, Claude, Supabase without heavy coding. Maintainable by non-developers. Note: Pipeline currently runs as a Python FastAPI app on localhost — n8n integration deferred.

Make (formerly Integromat)

Cloud-hosted visual automation. Easier setup than n8n, but recurring cost and less control.

Custom Python Scripts

Maximum flexibility. Best if you have a dedicated developer. Highest long-term maintenance burden.

☁️

Master File Storage Where the actual MP4, audio, and transcript files live at scale

Dropbox (current)

Not designed as a media archive at this scale. Expensive per GB, limited API for direct processing pipelines.

Backblaze B2 Recommended ✦

Very low cost per GB (~$6/TB/mo), S3-compatible API, reliable. Best price/performance for large video archives.

AWS S3

Industry standard, deep integrations. More expensive than B2 but integrates seamlessly with the broader AWS ecosystem.

Cloudflare R2

No egress fees — excellent if you're serving video directly. Worth considering for the public streaming phase.

Full Tech Stack

Every tool in the recommended stack, organized by function. Short-term choices prioritize speed; long-term choices prioritize scale and control.

File Processing

FFmpeg

Transcoding, audio extraction, batch scripts

HandBrake

GUI fallback for transcoding

iZotope RX / Audacity

Audio cleanup before transcription

Transcription & NLP

AssemblyAI

Primary: transcript + diarization + topics

Whisper (OpenAI)

Local/private processing option

Pyannote

Speaker diarization (if using Whisper)

AI Enrichment

Claude API

Summaries, tagging, structured metadata

Custom vocabulary glossary

Sanskrit / yogic terms submitted to all APIs

Automation

n8n

Visual workflow orchestration

Make

Cloud automation fallback

Dropbox API

Trigger pipeline on new file uploads

Database (Short-term)

Supabase (PostgreSQL)

Main database — events + media tables, event_media_view, RLS policies, auto-push from pipeline + FMP import

Cloudflare Pages

Static hosting for review UI + strategy site — Cloudflare Access gates /review/*, CDN edge delivery

Database (Long-term)

Supabase

PostgreSQL + auth + real-time + REST API

Typesense

Full-text transcript search layer

File Storage

Backblaze B2

Master file storage (video, audio, transcripts)

Cloudflare R2

Public delivery (no egress fees)

Dropbox

Team sync layer (not primary archive)

Public Platform

Vimeo OTT / Uscreen

Launch: streaming + commerce + user accounts

Mux

Custom build: adaptive video streaming

Stripe

Custom build: subscriptions + payments

Critical Gaps

Things not explicitly covered in the original workflow that must be addressed before scaling. These will cause expensive problems if ignored.

✓

No Master Metadata Schema Defined

Resolved. Events + media schema defined with controlled vocabularies: Setting (5 values), Tags (5 checkboxes), Celebrations (8 values), Locations (223 deduplicated from FMP), Series (expandable), Speakers (1,066 names). Import script and Review UI both enforce these vocabularies.

✓

File Naming Convention Not Established

Resolved. Canonical format established: YYYY-MM-DD_[type]_[location]_[seq].mp4. Enforced in pipeline code — files that don't match are rejected.

3

Master → Edit Relationship Not Modelled

The database schema must explicitly represent the parent-child relationship between a master recording and its edited versions, with edit metadata (who edited, when, what changed, why). Edits are a "living" part of the system and need a dedicated schema from day one.

4

No Quality Triage Step

Some recordings are degraded — poor audio, background noise, partially inaudible. There is no step in the current workflow to assess recording quality before committing transcription resources. A triage pass (even a simple 3-tier quality flag) should happen early.

5

Public vs. Private Access Not Addressed

Some recordings may contain personal conversations or content not intended for public release. A public/private/restricted access field must be added to the schema now — retroactively reviewing thousands of records before a public launch will be a major bottleneck if not done incrementally.

✓

Sanskrit & Mixed-Language Vocabulary Gap

Resolved. Custom Sanskrit/spiritual vocabulary glossary built and submitted as word_boost on every AssemblyAI API call. Stored in pipeline/transcribe/vocabulary/. Automatic language detection handles Hindi-English mixing.

7

No Staffing Plan for Human Review

At ~10 minutes of review per recording, processing 2,000 recordings means 333 hours of human review work. This is not a one-time task — new edits will always be added. Human review must be treated as a staffed, ongoing role with a defined capacity and throughput target.

Pitfalls to Avoid

Hard-earned lessons from similar archival projects. These are the most common ways this kind of work goes wrong.

Process Risk

Trying to automate before understanding the problem

Run 50 representative recordings through the full workflow manually first. You'll discover the Sanskrit vocabulary problem, audio quality edge cases, and metadata inconsistencies when they're cheap to fix — not after building an automated system around incorrect assumptions.

Team Risk

One-person bottleneck on database entry

If a single person controls adding or approving records, the entire pipeline stalls when they're unavailable. Build multi-user workflows from day one with clear roles and access levels.

Technical Risk

Dropbox as the permanent master archive

Dropbox is a sync tool, not an archive. It's expensive per GB at video scale, has no meaningful API for batch processing pipelines, and is a single point of failure. Migrate masters to object storage (B2 or S3) before scaling.

Quality Risk

Skipping human review to move faster

AI summaries of a spiritual teacher's talks will contain meaningful errors — misheard Sanskrit, wrong speaker attribution, summaries that miss the point. Publishing without review risks putting incorrect or misleading content in front of students and seekers.

Architecture Risk

Building a fully custom platform from day one

A custom database, custom UI, custom auth, and custom commerce layer is a 6+ month project with ongoing maintenance. Use best-of-breed services (Supabase, AssemblyAI, Cloudflare Pages) and connect them. Build custom only when you've outgrown the off-the-shelf tools. Update: the custom review UI was built in one session using static HTML/JS + Supabase — no server, no framework, minimal maintenance.

Data Risk

Inconsistent metadata across the backlog

Without a defined schema and controlled vocabularies, team members will use different tags, location names, and descriptions for the same concepts. This makes search and filtering unreliable. Update: Schema defined with controlled vocabularies — Setting (5 values), Tags (5 checkboxes), Celebrations (8 values), Locations (223 deduplicated from FMP), Series (expandable dropdown), Quality tiers (Good/Fair/Poor). Speaker names autocomplete from 1,066-name roster.

Legal / Privacy Risk

No access control framework before going public

Not all recordings are appropriate for public release. Personal conversations, informal talks, or sensitive content mixed into the archive could cause harm if published without review. Implement a per-record access flag early and review before any public launch.

Cost Risk

Underestimating transcription costs at scale

Several thousand hours of audio through a paid transcription API adds up quickly. Run a cost calculation against your full archive before committing to a service. Consider Whisper (local/free) for the bulk backlog and paid APIs for ongoing new recordings only.

A Living Archive ofTimeless Teachings

The Core Pipeline

Phased Roadmap

Key Decisions

Full Tech Stack

Critical Gaps

Pitfalls to Avoid

A Living Archive of
Timeless Teachings