1d ago
Infrastructure Engineer (Observability)
New York, New York, United States
$180k-$200k / year
full-timesenior Remoteai-ml
🛠 Tech Stack
💼 About This Role
You'll own and evolve observability platforms spanning metrics, logs, traces, and alerting for a large-scale GPU-enabled infrastructure. You'll productize observability for both internal teams and external customers, enabling multi-tenant monitoring experiences. This role offers the chance to work at the intersection of infrastructure, data, and product in a fast-growing AI company.
🎯 What You'll Do
- Own and evolve scalable observability platform (metrics, logs, traces, events)
- Drive productization of observability for internal teams and external customers
- Design multi-tenant observability systems with RBAC and customer-facing visibility
- Design telemetry pipelines ingesting GPU, CPU, networking, and container data
📋 Requirements
- 5+ years in infrastructure engineering, SRE, or observability roles
- Experience with Prometheus, Grafana, ELK, or VictoriaMetrics
- Proficiency in Python, Go, or bash for automation and data integration
- Familiarity with Kubernetes observability
✨ Nice to Have
- Experience with GPU observability (NVIDIA DCGM)
- Experience with InfiniBand fabric observability
- Experience building customer-facing or productized infrastructure systems
🎁 Benefits & Perks
- 🏥 Comprehensive medical, dental, and vision coverage
- 💰 Discretionary bonus
- 📈 Equity component
- 🏖️ Flexible remote/hybrid work
- 🌍 Global offices in NYC, SF, Seattle, London
📨 Hiring Process
Estimated timeline: 2-3 weeks · AI estimate
- 1Recruiter Screen· 30 min
- 2Technical Interview· 60 min
- 3Hiring Manager Interview· 45 min
0 0 0