Skip to content

Data Miner

Python 3.12+ PostgreSQL License: MIT

A PostgreSQL-backed, supervisor-managed video processing pipeline for generating large-scale computer vision datasets from YouTube videos.


Features

  • 🔍 YouTube Search - Find videos by keywords and hashtags
  • 📥 Smart Downloads - Rate-limited downloading with hashtag blocklists
  • 🎬 Frame Extraction - Configurable sampling strategies (interval, time, keyframe)
  • 🎯 ML Filtering - SigLIP2-based image-text similarity filtering
  • 🔄 Deduplication - DINOv3/FAISS-based cross-video deduplication
  • 🎯 Object Detection - Open-set detection (GroundingDINO, OWLv2)

User Guide Developer Docs
Installation Architecture Overview
Configuration Database Models
CLI Reference Worker System
Quickstart Contributing

Architecture Overview

flowchart LR
    subgraph Central["Central Pipeline"]
        D[Download] --> E[Extract]
    end

    subgraph Project["Per-Project Pipeline"]
        F[Filter] --> DU[Cross-Dedup] --> DT[Detect]
    end

    E --> F

The pipeline uses:

  • PostgreSQL for state management with row-level locking
  • Supervisor for worker process management
  • Heartbeat-based locking for concurrent safety

Getting Started

# Install
pip install -e .

# Initialize database
data-miner init-db

# Add videos and run pipeline
data-miner populate --config config.yaml
data-miner workers setup --config config.yaml
data-miner workers start

See Quickstart for the complete workflow.


Project Structure

data_miner/
├── cli.py              # CLI commands
├── config/             # Configuration system
├── db/                 # Database layer
├── workers/            # Supervisor-managed workers
├── modules/            # Core processing logic
├── models/             # ML model wrappers
└── utils/              # Utilities