# Audio Classifier - Technical Implementation TODO ## Phase 1: Project Structure & Dependencies ### 1.1 Root structure - [ ] Create root `.gitignore` - [ ] Create root `README.md` with setup instructions - [ ] Create `docker-compose.yml` (PostgreSQL + pgvector) - [ ] Create `.env.example` ### 1.2 Backend structure (Python/FastAPI) - [ ] Create `backend/` directory - [ ] Create `backend/requirements.txt`: - fastapi==0.109.0 - uvicorn[standard]==0.27.0 - sqlalchemy==2.0.25 - psycopg2-binary==2.9.9 - pgvector==0.2.4 - librosa==0.10.1 - essentia-tensorflow==2.1b6.dev1110 - pydantic==2.5.3 - pydantic-settings==2.1.0 - python-multipart==0.0.6 - mutagen==1.47.0 - numpy==1.24.3 - scipy==1.11.4 - [ ] Create `backend/pyproject.toml` (optional, for poetry users) - [ ] Create `backend/.env.example` - [ ] Create `backend/Dockerfile` - [ ] Create `backend/src/__init__.py` ### 1.3 Backend core modules structure - [ ] `backend/src/core/__init__.py` - [ ] `backend/src/core/audio_processor.py` - librosa feature extraction - [ ] `backend/src/core/essentia_classifier.py` - Essentia models (genre/mood/instruments) - [ ] `backend/src/core/analyzer.py` - Main orchestrator - [ ] `backend/src/core/file_scanner.py` - Recursive folder scanning - [ ] `backend/src/core/waveform_generator.py` - Peaks extraction for visualization ### 1.4 Backend database modules - [ ] `backend/src/models/__init__.py` - [ ] `backend/src/models/database.py` - SQLAlchemy engine + session - [ ] `backend/src/models/schema.py` - SQLAlchemy models (AudioTrack) - [ ] `backend/src/models/crud.py` - CRUD operations - [ ] `backend/src/alembic/` - Migration setup - [ ] `backend/src/alembic/versions/001_initial_schema.py` - CREATE TABLE + pgvector extension ### 1.5 Backend API structure - [ ] `backend/src/api/__init__.py` - [ ] `backend/src/api/main.py` - FastAPI app + CORS + startup/shutdown events - [ ] `backend/src/api/routes/__init__.py` - [ ] `backend/src/api/routes/tracks.py` - GET /tracks, GET /tracks/{id}, DELETE /tracks/{id} - [ ] `backend/src/api/routes/search.py` - GET /search?q=...&genre=...&mood=... - [ ] `backend/src/api/routes/analyze.py` - POST /analyze/folder, GET /analyze/status/{job_id} - [ ] `backend/src/api/routes/audio.py` - GET /audio/stream/{id}, GET /audio/download/{id}, GET /audio/waveform/{id} - [ ] `backend/src/api/routes/similar.py` - GET /tracks/{id}/similar - [ ] `backend/src/api/routes/stats.py` - GET /stats (total tracks, genres distribution) ### 1.6 Backend utils - [ ] `backend/src/utils/__init__.py` - [ ] `backend/src/utils/config.py` - Pydantic Settings for env vars - [ ] `backend/src/utils/logging.py` - Logging setup - [ ] `backend/src/utils/validators.py` - Audio file validation ### 1.7 Frontend structure (Next.js 14) - [ ] `npx create-next-app@latest frontend --typescript --tailwind --app --no-src-dir` - [ ] `cd frontend && npm install` - [ ] Install deps: `shadcn-ui`, `@tanstack/react-query`, `zustand`, `axios`, `lucide-react`, `recharts` - [ ] `npx shadcn-ui@latest init` - [ ] Add shadcn components: button, input, slider, select, card, dialog, progress, toast ### 1.8 Frontend structure details - [ ] `frontend/app/layout.tsx` - Root layout with QueryClientProvider - [ ] `frontend/app/page.tsx` - Main library view - [ ] `frontend/app/tracks/[id]/page.tsx` - Track detail page - [ ] `frontend/components/SearchBar.tsx` - [ ] `frontend/components/FilterPanel.tsx` - [ ] `frontend/components/TrackCard.tsx` - [ ] `frontend/components/TrackDetails.tsx` - [ ] `frontend/components/AudioPlayer.tsx` - [ ] `frontend/components/WaveformDisplay.tsx` - [ ] `frontend/components/BatchScanner.tsx` - [ ] `frontend/components/SimilarTracks.tsx` - [ ] `frontend/lib/api.ts` - Axios client with base URL - [ ] `frontend/lib/types.ts` - TypeScript interfaces - [ ] `frontend/hooks/useSearch.ts` - [ ] `frontend/hooks/useTracks.ts` - [ ] `frontend/hooks/useAudioPlayer.ts` - [ ] `frontend/.env.local.example` --- ## Phase 2: Database Schema & Migrations ### 2.1 PostgreSQL setup - [ ] `docker-compose.yml`: service postgres with pgvector image `pgvector/pgvector:pg16` - [ ] Expose port 5432 - [ ] Volume for persistence: `postgres_data:/var/lib/postgresql/data` - [ ] Init script: `backend/init-db.sql` with CREATE EXTENSION vector ### 2.2 SQLAlchemy models - [ ] Define `AudioTrack` model in `schema.py`: - id: UUID (PK) - filepath: String (unique, indexed) - filename: String - duration_seconds: Float - file_size_bytes: Integer - format: String (mp3/wav) - analyzed_at: DateTime - tempo_bpm: Float - key: String - time_signature: String - energy: Float - danceability: Float - valence: Float - loudness_lufs: Float - spectral_centroid: Float - zero_crossing_rate: Float - genre_primary: String (indexed) - genre_secondary: ARRAY[String] - genre_confidence: Float - mood_primary: String (indexed) - mood_secondary: ARRAY[String] - mood_arousal: Float - mood_valence: Float - instruments: ARRAY[String] - has_vocals: Boolean - vocal_gender: String (nullable) - embedding: Vector(512) (nullable, for future CLAP) - embedding_model: String (nullable) - metadata: JSON - [ ] Create indexes: filepath, genre_primary, mood_primary, tempo_bpm ### 2.3 Alembic migrations - [ ] `alembic init backend/src/alembic` - [ ] Configure `alembic.ini` with DB URL - [ ] Create initial migration with schema above - [ ] Add pgvector extension in migration --- ## Phase 3: Core Audio Processing ### 3.1 audio_processor.py - Librosa feature extraction - [ ] Function `load_audio(filepath: str) -> Tuple[np.ndarray, int]` - [ ] Function `extract_tempo(y, sr) -> float` - librosa.beat.tempo - [ ] Function `extract_key(y, sr) -> str` - librosa.feature.chroma_cqt + key detection - [ ] Function `extract_spectral_features(y, sr) -> dict`: - spectral_centroid - zero_crossing_rate - spectral_rolloff - spectral_bandwidth - [ ] Function `extract_mfcc(y, sr) -> np.ndarray` - [ ] Function `extract_chroma(y, sr) -> np.ndarray` - [ ] Function `extract_energy(y, sr) -> float` - RMS energy - [ ] Function `extract_all_features(filepath: str) -> dict` - orchestrator ### 3.2 essentia_classifier.py - Essentia TensorFlow models - [ ] Download Essentia models (mtg-jamendo): - genre: https://essentia.upf.edu/models/classification-heads/mtg_jamendo_genre/mtg_jamendo_genre-discogs-effnet-1.pb - mood: https://essentia.upf.edu/models/classification-heads/mtg_jamendo_moodtheme/mtg_jamendo_moodtheme-discogs-effnet-1.pb - instrument: https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb - [ ] Store models in `backend/models/` directory - [ ] Class `EssentiaClassifier`: - `__init__()`: load models - `predict_genre(audio_path: str) -> dict`: returns {primary, secondary[], confidence} - `predict_mood(audio_path: str) -> dict`: returns {primary, secondary[], arousal, valence} - `predict_instruments(audio_path: str) -> List[dict]`: returns [{name, confidence}, ...] - [ ] Add model metadata files (class labels) in JSON ### 3.3 waveform_generator.py - [ ] Function `generate_peaks(filepath: str, num_peaks: int = 800) -> List[float]` - Load audio with librosa - Downsample to num_peaks points - Return normalized amplitude values - [ ] Cache peaks in JSON file next to audio (optional) ### 3.4 file_scanner.py - [ ] Function `scan_folder(path: str, recursive: bool = True) -> List[str]` - Walk directory tree - Filter by extensions: .mp3, .wav, .flac, .m4a, .ogg - Return list of absolute paths - [ ] Function `get_file_metadata(filepath: str) -> dict` - Use mutagen for ID3 tags - Return: filename, size, format ### 3.5 analyzer.py - Main orchestrator - [ ] Class `AudioAnalyzer`: - `__init__()` - `analyze_file(filepath: str) -> AudioAnalysis`: 1. Validate file exists and is audio 2. Extract features (audio_processor) 3. Classify genre/mood/instruments (essentia_classifier) 4. Get file metadata (file_scanner) 5. Return structured AudioAnalysis object - `analyze_folder(path: str, recursive: bool, progress_callback) -> List[AudioAnalysis]`: - Scan folder - Parallel processing with ThreadPoolExecutor (num_workers=4) - Progress updates - [ ] Pydantic model `AudioAnalysis` matching JSON schema from architecture --- ## Phase 4: Database CRUD Operations ### 4.1 crud.py - CRUD functions - [ ] `create_track(session, analysis: AudioAnalysis) -> AudioTrack` - [ ] `get_track_by_id(session, track_id: UUID) -> Optional[AudioTrack]` - [ ] `get_track_by_filepath(session, filepath: str) -> Optional[AudioTrack]` - [ ] `get_tracks(session, skip: int, limit: int, filters: dict) -> List[AudioTrack]` - Support filters: genre, mood, bpm_min, bpm_max, energy_min, energy_max, has_vocals - [ ] `search_tracks(session, query: str, filters: dict, limit: int) -> List[AudioTrack]` - Full-text search on: genre_primary, mood_primary, instruments, filename - Combined with filters - [ ] `get_similar_tracks(session, track_id: UUID, limit: int) -> List[AudioTrack]` - If embeddings exist: vector similarity with pgvector - Fallback: similar genre + mood + BPM range - [ ] `delete_track(session, track_id: UUID) -> bool` - [ ] `get_stats(session) -> dict` - Total tracks - Genres distribution - Moods distribution - Average BPM - Total duration --- ## Phase 5: FastAPI Backend Implementation ### 5.1 config.py - Settings - [ ] `class Settings(BaseSettings)`: - DATABASE_URL: str - CORS_ORIGINS: List[str] - ANALYSIS_USE_CLAP: bool = False - ANALYSIS_NUM_WORKERS: int = 4 - ESSENTIA_MODELS_PATH: str - AUDIO_LIBRARY_PATH: str (optional default scan path) - [ ] Load from `.env` ### 5.2 main.py - FastAPI app - [ ] Create FastAPI app with metadata (title, version, description) - [ ] Add CORS middleware (allow frontend origin) - [ ] Add startup event: init DB engine, load Essentia models - [ ] Add shutdown event: cleanup - [ ] Include routers from routes/ - [ ] Health check endpoint: GET /health ### 5.3 routes/tracks.py - [ ] `GET /api/tracks`: - Query params: skip, limit, genre, mood, bpm_min, bpm_max, energy_min, energy_max, has_vocals, sort_by - Return paginated list of tracks - Include total count - [ ] `GET /api/tracks/{track_id}`: - Return full track details - 404 if not found - [ ] `DELETE /api/tracks/{track_id}`: - Soft delete or hard delete (remove from DB only, keep file) - Return success ### 5.4 routes/search.py - [ ] `GET /api/search`: - Query params: q (search query), genre, mood, bpm_min, bpm_max, limit - Full-text search + filters - Return matching tracks ### 5.5 routes/audio.py - [ ] `GET /api/audio/stream/{track_id}`: - Get track from DB - Return FileResponse with media_type audio/mpeg - Support Range requests for seeking (Accept-Ranges: bytes) - headers: Content-Disposition: inline - [ ] `GET /api/audio/download/{track_id}`: - Same as stream but Content-Disposition: attachment - [ ] `GET /api/audio/waveform/{track_id}`: - Get track from DB - Generate or load cached peaks (waveform_generator) - Return JSON: {peaks: [], duration: float} ### 5.6 routes/analyze.py - [ ] `POST /api/analyze/folder`: - Body: {path: str, recursive: bool} - Validate path exists - Start background job (asyncio Task or Celery) - Return job_id - [ ] `GET /api/analyze/status/{job_id}`: - Return job status: {status: "pending|running|completed|failed", progress: int, total: int, errors: []} - [ ] Background worker implementation: - Scan folder - For each file: analyze, save to DB (skip if already exists by filepath) - Update job status - Store job state in-memory dict or Redis ### 5.7 routes/similar.py - [ ] `GET /api/tracks/{track_id}/similar`: - Query params: limit (default 10) - Get similar tracks (CRUD function) - Return list of tracks ### 5.8 routes/stats.py - [ ] `GET /api/stats`: - Get stats (CRUD function) - Return JSON with counts, distributions --- ## Phase 6: Frontend Implementation ### 6.1 API client (lib/api.ts) - [ ] Create axios instance with baseURL from env var (NEXT_PUBLIC_API_URL) - [ ] API functions: - `getTracks(params: FilterParams): Promise<{tracks: Track[], total: number}>` - `getTrack(id: string): Promise` - `deleteTrack(id: string): Promise` - `searchTracks(query: string, filters: FilterParams): Promise` - `getSimilarTracks(id: string, limit: number): Promise` - `analyzeFolder(path: string, recursive: boolean): Promise<{jobId: string}>` - `getAnalyzeStatus(jobId: string): Promise` - `getStats(): Promise` ### 6.2 TypeScript types (lib/types.ts) - [ ] `interface Track` matching AudioTrack model - [ ] `interface FilterParams` - [ ] `interface JobStatus` - [ ] `interface Stats` ### 6.3 Hooks - [ ] `hooks/useTracks.ts`: - useQuery for fetching tracks with filters - Pagination state - Mutation for delete - [ ] `hooks/useSearch.ts`: - Debounced search query - Combined filters state - [ ] `hooks/useAudioPlayer.ts`: - Current track state - Play/pause/seek controls - Volume control - Queue management (optional) ### 6.4 Components - UI primitives (shadcn) - [ ] Install shadcn components: button, input, slider, select, card, dialog, badge, progress, toast, dropdown-menu, tabs ### 6.5 SearchBar.tsx - [ ] Input with search icon - [ ] Debounced onChange (300ms) - [ ] Clear button - [ ] Optional: suggestions dropdown ### 6.6 FilterPanel.tsx - [ ] Genre multi-select (fetch available genres from API or hardcode) - [ ] Mood multi-select - [ ] BPM range slider (min/max) - [ ] Energy range slider - [ ] Has vocals checkbox - [ ] Sort by dropdown (Latest, BPM, Duration, Name) - [ ] Clear all filters button ### 6.7 TrackCard.tsx - [ ] Props: track: Track, onPlay, onDelete - [ ] Display: filename, duration, BPM, genre, mood, instruments (badges) - [ ] Inline AudioPlayer component - [ ] Buttons: Play, Download, Similar, Details - [ ] Hover effects ### 6.8 AudioPlayer.tsx - [ ] Props: trackId, filename, duration - [ ] HTML5 audio element with ref - [ ] WaveformDisplay child component - [ ] Progress slider (seek support) - [ ] Play/Pause button - [ ] Volume slider with icon - [ ] Time display (current / total) - [ ] Download button (calls /api/audio/download/{id}) ### 6.9 WaveformDisplay.tsx - [ ] Props: trackId, currentTime, duration - [ ] Fetch peaks from /api/audio/waveform/{id} - [ ] Canvas rendering: - Draw bars for each peak - Color played portion differently (blue vs gray) - Click to seek - [ ] Loading state while fetching peaks ### 6.10 TrackDetails.tsx (Modal/Dialog) - [ ] Props: trackId, open, onClose - [ ] Fetch full track details - [ ] Display all metadata in organized sections: - Audio info: duration, format, file size - Musical features: tempo, key, time signature, energy, danceability, valence - Classification: genre (primary + secondary), mood (primary + secondary + arousal/valence), instruments - Spectral features: spectral centroid, zero crossing rate, loudness - [ ] Similar tracks section (preview) - [ ] Download button ### 6.11 SimilarTracks.tsx - [ ] Props: trackId, limit - [ ] Fetch similar tracks - [ ] Display as list of mini TrackCards - [ ] Click to navigate or play ### 6.12 BatchScanner.tsx - [ ] Input for folder path - [ ] Recursive checkbox - [ ] Scan button - [ ] Progress bar (poll /api/analyze/status/{jobId}) - [ ] Status messages (pending, running X/Y, completed, errors) - [ ] Error list if any ### 6.13 Main page (app/page.tsx) - [ ] SearchBar at top - [ ] FilterPanel in sidebar or collapsible - [ ] BatchScanner in header or dedicated section - [ ] TrackCard grid/list - [ ] Pagination controls (Load More or page numbers) - [ ] Total tracks count - [ ] Loading states - [ ] Empty state if no tracks ### 6.14 Track detail page (app/tracks/[id]/page.tsx) - [ ] Fetch track by ID - [ ] Large AudioPlayer - [ ] Full metadata display (similar to TrackDetails modal) - [ ] SimilarTracks section - [ ] Back to library button ### 6.15 Layout (app/layout.tsx) - [ ] QueryClientProvider setup - [ ] Toast provider (for notifications) - [ ] Global styles - [ ] Header with app title and nav --- ## Phase 7: Docker & Deployment ### 7.1 docker-compose.yml - [ ] Service: postgres - image: pgvector/pgvector:pg16 - environment: POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB - ports: 5432:5432 - volumes: postgres_data, init-db.sql - [ ] Service: backend - build: ./backend - depends_on: postgres - environment: DATABASE_URL - ports: 8000:8000 - volumes: audio files mount (read-only) - [ ] Service: frontend (optional, or dev mode only) - build: ./frontend - ports: 3000:3000 - environment: NEXT_PUBLIC_API_URL=http://localhost:8000 ### 7.2 Backend Dockerfile - [ ] FROM python:3.11-slim - [ ] Install system deps: ffmpeg, libsndfile1 - [ ] COPY requirements.txt - [ ] RUN pip install -r requirements.txt - [ ] COPY src/ - [ ] Download Essentia models during build or on startup - [ ] CMD: uvicorn src.api.main:app --host 0.0.0.0 --port 8000 ### 7.3 Frontend Dockerfile (production build) - [ ] FROM node:20-alpine - [ ] COPY package.json, package-lock.json - [ ] RUN npm ci - [ ] COPY app/, components/, lib/, hooks/, public/ - [ ] RUN npm run build - [ ] CMD: npm start --- ## Phase 8: Documentation & Scripts ### 8.1 Root README.md - [ ] Project description - [ ] Features list - [ ] Tech stack - [ ] Prerequisites (Docker, Node, Python) - [ ] Quick start: - Clone repo - Copy .env.example to .env - docker-compose up - Access frontend at localhost:3000 - [ ] Development setup - [ ] API documentation link (FastAPI /docs) - [ ] Architecture diagram (optional) ### 8.2 Backend README.md - [ ] Setup instructions - [ ] Environment variables documentation - [ ] Essentia models download instructions - [ ] API endpoints list - [ ] Database schema - [ ] Running migrations ### 8.3 Frontend README.md - [ ] Setup instructions - [ ] Environment variables - [ ] Available scripts (dev, build, start) - [ ] Component structure ### 8.4 Scripts - [ ] `scripts/download-essentia-models.sh` - Download Essentia models - [ ] `scripts/init-db.sh` - Run migrations - [ ] `backend/src/cli.py` - CLI for manual analysis (optional) --- ## Phase 9: Testing & Validation ### 9.1 Backend tests (optional but recommended) - [ ] Test audio_processor.extract_all_features with sample file - [ ] Test essentia_classifier with sample file - [ ] Test CRUD operations - [ ] Test API endpoints with pytest + httpx ### 9.2 Frontend tests (optional) - [ ] Test API client functions - [ ] Test hooks - [ ] Component tests with React Testing Library ### 9.3 Integration test - [ ] Full flow: analyze folder -> save to DB -> search -> play -> download --- ## Phase 10: Optimizations & Polish ### 10.1 Performance - [ ] Add database indexes - [ ] Cache waveform peaks - [ ] Optimize audio loading (lazy loading for large libraries) - [ ] Add compression for API responses ### 10.2 UX improvements - [ ] Loading skeletons - [ ] Error boundaries - [ ] Toast notifications for actions - [ ] Keyboard shortcuts (space to play/pause, arrows to seek) - [ ] Dark mode support ### 10.3 Backend improvements - [ ] Rate limiting - [ ] Request validation with Pydantic - [ ] Logging (structured logs) - [ ] Error handling middleware --- ## Implementation order priority 1. **Phase 2** (Database) - Foundation 2. **Phase 3** (Audio processing) - Core logic 3. **Phase 4** (CRUD) - Data layer 4. **Phase 5.1-5.2** (FastAPI setup) - API foundation 5. **Phase 5.3-5.8** (API routes) - Complete backend 6. **Phase 6.1-6.3** (Frontend setup + API client + hooks) - Frontend foundation 7. **Phase 6.4-6.12** (Components) - UI implementation 8. **Phase 6.13-6.15** (Pages) - Complete frontend 9. **Phase 7** (Docker) - Deployment 10. **Phase 8** (Documentation) - Final polish --- ## Notes for implementation - Use type hints everywhere in Python - Use TypeScript strict mode in frontend - Handle errors gracefully (try/catch, proper HTTP status codes) - Add logging at key points (file analysis start/end, DB operations) - Validate file paths (security: prevent path traversal) - Consider file locking for concurrent analysis - Add progress updates for long operations - Use environment variables for all config - Keep audio files outside Docker volumes for performance - Consider caching Essentia predictions (expensive) - Add retry logic for failed analyses - Support cancellation for long-running jobs ## Files to download/prepare before starting 1. Essentia models (3 files): - mtg_jamendo_genre-discogs-effnet-1.pb - mtg_jamendo_moodtheme-discogs-effnet-1.pb - mtg_jamendo_instrument-discogs-effnet-1.pb 2. Class labels JSON for each model 3. Sample audio files for testing ## External dependencies verification - librosa: check version compatibility with numpy - essentia-tensorflow: verify CPU-only build works - pgvector: verify PostgreSQL extension installation - FFmpeg: required by librosa for audio decoding ## Security considerations - Validate all file paths (no ../ traversal) - Sanitize user input in search queries - Rate limit API endpoints - CORS: whitelist frontend origin only - Don't expose full filesystem paths in API responses - Consider adding authentication later (JWT) ## Future enhancements (not in current scope) - CLAP embeddings for semantic search - Batch export to CSV/JSON - Playlist creation - Audio trimming/preview segments - Duplicate detection (audio fingerprinting) - Tag editing (write back to files) - Multi-user support with authentication - WebSocket for real-time analysis progress - Audio visualization (spectrogram, chromagram)