3h ago
Staff / Senior Software Engineer, Compute Capacity
San Francisco, CA | New York City, NY
full-timeseniorArtificial Intelligence
Tech Stack
Description
You will build production systems that power Anthropic's large accelerator fleet, including data pipelines, observability tooling, and compute efficiency instrumentation. Your work will directly influence decisions around one of the company's largest areas of spend, collaborating with research engineering, infrastructure, and finance teams.
Requirements
- Experience building production-quality data pipelines (e.g., using BigQuery, Airflow, or similar).
- Proficiency with Kubernetes-native infrastructure and operations at scale.
- Experience with observability tooling (Prometheus, Grafana) and instrumentation.
- Ability to work across data engineering, systems engineering, and observability.
- Comfortable in a high-autonomy, high-ambiguity environment.
Responsibilities
- Build and operate data pipelines ingesting accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery.
- Own data completeness, latency SLOs, gap detection, and backfill automation.
- Develop and maintain observability infrastructure (Prometheus recording rules, Grafana dashboards, alerting systems) for fleet health, occupancy, and efficiency.
- Instrument and analyze compute efficiency metrics across training, inference, and eval workloads; build benchmarking infrastructure and baselines.
- Build internal tooling and platforms for capacity planning, workload attribution, and cluster debugging.
0 views 0 saves 0 applications