Configuration¶
Data Miner uses YAML configuration files with OmegaConf for variable interpolation and Pydantic for validation.
Configuration Loading¶
The config system supports three modes:
- Default config - Built-in defaults from
data_miner/config/default.yaml - User config - Override with
--config path/to/config.yaml - Environment variable - Set
DATA_MINER_CONFIG=/path/to/config.yaml
User configs are merged with defaults, so you only need to specify overrides.
Minimal Config Example¶
# config.yaml
project_name: "glass_doors"
output_dir: "./output"
input:
search_queries:
- "glass door installation"
- "sliding glass door"
max_results_per_query: 50
filter:
positive_prompts:
- "a glass door"
- "a sliding door"
Full Configuration Reference¶
Project Settings¶
project_name: "my_project"
output_dir: "./output"
project_output_dir: "${output_dir}/projects/${project_name}"
device: "auto" # auto, cuda, cuda:0, cpu
Variable Interpolation: Use
${section.key}to reference other config values.
Input Sources¶
input:
# YouTube search
search_enabled: true
search_queries:
- "glass door installation"
max_results_per_query: 50
# Direct URLs
urls:
- "https://www.youtube.com/watch?v=abc123"
# URL file (one URL per line)
url_file: "urls.txt"
Database¶
Supervisor (Worker Counts)¶
supervisor:
download_workers: 3 # Parallel downloaders
extract_workers: 2 # Frame extractors
filter_workers: 1 # ML filter workers (GPU-bound)
dedup_workers: 1 # Deduplication workers
detect_workers: 1 # Detection workers
Set any worker count to 0 to disable that stage.
Download Stage¶
download:
output_dir: "${output_dir}/videos"
format: "bestvideo[height<=1080]+bestaudio/best[height<=1080]"
max_resolution: 1080
timeout: 300
# Rate limiting (avoid YouTube blocks)
sleep_interval: 30 # Min seconds between downloads
max_sleep_interval: 60 # Max seconds (randomized)
sleep_requests: 10 # Seconds between API requests
# Hashtag blocklist file
blocked_hashtag_patterns: "blocked_hashtags.txt"
Extract Stage¶
extract:
output_dir: "${output_dir}/frames_raw"
strategy: "interval" # interval, time, keyframe
interval_frames: 30 # Every N frames
interval_seconds: 1.0 # Every N seconds (for time strategy)
max_frames_per_video: 5000
image_format: "jpg" # jpg, png, webp
quality: 95 # JPEG/WebP quality (1-100)
Filter Stage (SigLIP2)¶
filter:
output_dir: "${project_output_dir}/frames_filtered"
device: "${device}"
model_id: "siglip2-so400m" # siglip2-so400m, siglip2-giant
batch_size: 32
# Thresholds
threshold: 0.25 # Min positive match score
margin_threshold: 0.05 # Positive must beat negative by this
positive_prompts:
- "a glass door"
- "a sliding door"
negative_prompts:
- "a glass wall"
- "a mirror"
Dedup Stage (FAISS)¶
dedup:
output_dir: "${project_output_dir}/frames_dedup"
device: "${device}"
model_type: "dino" # dino, siglip
dino_model_id: "dinov3-base" # dinov2-base, dinov3-base, etc.
threshold: 0.90 # Similarity threshold
batch_size: 64
k_neighbors: 50 # FAISS KNN search depth
Detect Stage¶
detect:
output_dir: "${project_output_dir}/detections"
device: "${device}"
detector: "grounding_dino" # grounding_dino, owlv2, florence2
threshold: 0.3
confidence_threshold: 0.3
batch_size: 16
save_visualizations: true
Monitor Settings¶
The monitor worker handles:
- Project stage transitions (e.g., FILTERING → DEDUP_READY)
- Stale lock recovery (resets locks from crashed workers)
- Frame count aggregation
monitor:
poll_interval: 10 # Seconds between checks
stale_threshold_minutes: 2 # Reset stale locks after N minutes
long_running_threshold_minutes: 30 # Warn about old locks
cleanup_extracted_videos: false # Delete videos after extraction
Backup Settings¶
The backup worker syncs frames_raw/ to a remote destination after videos are extracted.
backup:
enabled: false # Enable backup worker
remote_dest: "user@host:/path" # SSH destination or local path
delete_after_backup: false # Delete local frames after verified backup
poll_interval: 300 # Seconds between backup checks
verification_timeout: 1800 # Seconds for rsync verification
Note: Backup uses rsync over SSH. Ensure SSH keys are configured for passwordless access.
Logging (Grafana + Loki)¶
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR
loki_url: "http://localhost:3100/loki/api/v1/push" # Loki push endpoint
log_dir: "output/logs" # Local log directory
Logs are automatically sent to:
- Console - Always enabled
- File - If
LOG_FILEenv var is set - Loki - If
python-logging-lokiis installed and Loki is running
Access logs in Grafana:
- Open
http://localhost:3000 - Add Loki data source:
http://loki:3100 - Use LogQL queries:
{application="data_miner"}
Model ID Reference¶
| Stage | Model ID | Full HuggingFace Path |
|---|---|---|
| Filter | siglip2-so400m |
google/siglip2-so400m-patch14-384 |
| Filter | siglip2-giant |
google/siglip2-giant-opt-patch16-384 |
| Dedup | dinov3-base |
facebook/dinov3-vitb16-pretrain-lvd1689m |
| Dedup | dinov2-large |
facebook/dinov2-large |
| Detect | grounding_dino |
IDEA-Research/grounding-dino-base |
| Detect | florence2 |
microsoft/Florence-2-large |
Next Steps¶
- CLI Reference - Available commands
- Quickstart - Run the pipeline