Data Miner¶
A PostgreSQL-backed, supervisor-managed video processing pipeline for generating large-scale computer vision datasets from YouTube videos.
Features¶
- 🔍 YouTube Search - Find videos by keywords and hashtags
- 📥 Smart Downloads - Rate-limited downloading with hashtag blocklists
- 🎬 Frame Extraction - Configurable sampling strategies (interval, time, keyframe)
- 🎯 ML Filtering - SigLIP2-based image-text similarity filtering
- 🔄 Deduplication - DINOv3/FAISS-based cross-video deduplication
- 🎯 Object Detection - Open-set detection (GroundingDINO, OWLv2)
Quick Links¶
| User Guide | Developer Docs |
|---|---|
| Installation | Architecture Overview |
| Configuration | Database Models |
| CLI Reference | Worker System |
| Quickstart | Contributing |
Architecture Overview¶
flowchart LR
subgraph Central["Central Pipeline"]
D[Download] --> E[Extract]
end
subgraph Project["Per-Project Pipeline"]
F[Filter] --> DU[Cross-Dedup] --> DT[Detect]
end
E --> F
The pipeline uses:
- PostgreSQL for state management with row-level locking
- Supervisor for worker process management
- Heartbeat-based locking for concurrent safety
Getting Started¶
# Install
pip install -e .
# Initialize database
data-miner init-db
# Add videos and run pipeline
data-miner populate --config config.yaml
data-miner workers setup --config config.yaml
data-miner workers start
See Quickstart for the complete workflow.