19h ago
Principal Production Engineer
San Francisco, CA - US
$261k-$326k / year
full-timeleadai-ml
๐ Tech Stack
๐ผ About This Role
You'll own reliability, scalability, and operational excellence of Crusoe's cloud infrastructure across compute, storage, and networking. You'll define observability strategy and drive reliability standards across the organization, making the people around you meaningfully better. This high-ownership role has scope that expands every quarter.
๐ฏ What You'll Do
- Own reliability and scalability of cloud infrastructure defining SLOs and leading incident response.
- Build and mature observability and tooling for network, storage, and control plane.
- Drive platform reliability improvements across the full cloud stack.
- Act as trusted advisor to senior leadership on observability and reliability strategy.
๐ Requirements
- 15+ years in infrastructure, networking, or production engineering at internet-scale companies.
- Deep expertise in observability: building telemetry pipelines and instrumenting distributed systems.
- Strong systems fundamentals across Linux, distributed systems, storage, and compute scheduling.
- Ability to write code to automate, instrument, and build tooling.
โจ Nice to Have
- Deep networking expertise (BGP, OSPF, ECMP, load balancing, low-latency design).
- Experience with HPC infrastructure (GPU clusters, Slurm, Kubernetes, InfiniBand, RoCE).
- Prior principal or staff IC experience influencing org-level technical strategy.
๐ Benefits & Perks
- ๐ฐ Industry competitive pay
- ๐ Restricted Stock Units
- ๐ฅ Health insurance (HDHP, PPO, vision, dental)
- ๐ถ Paid Parental Leave
- ๐๏ธ Generous PTO and holiday schedule
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Hiring Manager Interviewยท 60 min
๐ฉ Heads Up
- Requirements list 15+ years experience, unusually high.
- Role mixes production engineering, observability, and strategic advisory, possibly broad scope.
0 0 0