Infrastructure Engineer (Observability) at Jobs at Lightning AI

1d ago

Infrastructure Engineer (Observability)

New York, New York, United States

$180k-$200k / year

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll own and evolve observability platforms spanning metrics, logs, traces, and alerting for a large-scale GPU-enabled infrastructure. You'll productize observability for both internal teams and external customers, enabling multi-tenant monitoring experiences. This role offers the chance to work at the intersection of infrastructure, data, and product in a fast-growing AI company.

🎯 What You'll Do

Own and evolve scalable observability platform (metrics, logs, traces, events)
Drive productization of observability for internal teams and external customers
Design multi-tenant observability systems with RBAC and customer-facing visibility
Design telemetry pipelines ingesting GPU, CPU, networking, and container data

📋 Requirements

5+ years in infrastructure engineering, SRE, or observability roles
Experience with Prometheus, Grafana, ELK, or VictoriaMetrics
Proficiency in Python, Go, or bash for automation and data integration
Familiarity with Kubernetes observability

✨ Nice to Have

Experience with GPU observability (NVIDIA DCGM)
Experience with InfiniBand fabric observability
Experience building customer-facing or productized infrastructure systems

🎁 Benefits & Perks

🏥 Comprehensive medical, dental, and vision coverage
💰 Discretionary bonus
📈 Equity component
🏖️ Flexible remote/hybrid work
🌍 Global offices in NYC, SF, Seattle, London

📨 Hiring Process

Estimated timeline: 2-3 weeks · AI estimate

1Recruiter Screen· 30 min
2Technical Interview· 60 min
3Hiring Manager Interview· 45 min

Jobs at Lightning AI

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. We recently merged with Voltage Park to create the first cloud built for AI! We operate globally in New York City, San Francisco, Seattle, and London, and…

Other jobs at Jobs at Lightning AI

No other jobs found.

0 0 0