1d ago

Infrastructure Engineer (Observability)

New York, New York, United States

$180k-$200k / year

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll own and evolve observability platforms spanning metrics, logs, traces, and alerting for a large-scale GPU-enabled infrastructure. You'll productize observability for both internal teams and external customers, enabling multi-tenant monitoring experiences. This role offers the chance to work at the intersection of infrastructure, data, and product in a fast-growing AI company.

🎯 What You'll Do

  • Own and evolve scalable observability platform (metrics, logs, traces, events)
  • Drive productization of observability for internal teams and external customers
  • Design multi-tenant observability systems with RBAC and customer-facing visibility
  • Design telemetry pipelines ingesting GPU, CPU, networking, and container data

📋 Requirements

  • 5+ years in infrastructure engineering, SRE, or observability roles
  • Experience with Prometheus, Grafana, ELK, or VictoriaMetrics
  • Proficiency in Python, Go, or bash for automation and data integration
  • Familiarity with Kubernetes observability

✨ Nice to Have

  • Experience with GPU observability (NVIDIA DCGM)
  • Experience with InfiniBand fabric observability
  • Experience building customer-facing or productized infrastructure systems

🎁 Benefits & Perks

  • 🏥 Comprehensive medical, dental, and vision coverage
  • 💰 Discretionary bonus
  • 📈 Equity component
  • 🏖️ Flexible remote/hybrid work
  • 🌍 Global offices in NYC, SF, Seattle, London

📨 Hiring Process

Estimated timeline: 2-3 weeks · AI estimate

  1. 1Recruiter Screen· 30 min
  2. 2Technical Interview· 60 min
  3. 3Hiring Manager Interview· 45 min
0 0 0