3h ago
Senior Cloud Platform Engineer
Palo Alto, California, United States
full-timeseniorArtificial Intelligence
Tech Stack
Description
You will own the reliability, performance, and scalability of our AI Inferencing Service. You'll bridge development and operations, ensuring exceptional uptime, low latency, and efficient resource utilization through monitoring, automation, and incident management.
Requirements
- Experience as a Senior SRE or Platform Engineer focusing on AI/ML services
- Expertise in cloud infrastructure (AWS, GCP, Azure) and IaC tools
- Proficiency in monitoring tools (Prometheus, Grafana, Datadog)
- Strong skills in automation, CI/CD, and scripting languages
- Ability to define SLOs/SLIs and manage capacity planning
Responsibilities
- Own production inferencing service availability, latency, performance, and capacity planning across regions
- Participate in 24/7 on-call rotation with follow-the-sun model
- Lead incident response and blameless post-mortems
- Develop monitoring, alerting, and dashboards using Prometheus, Grafana, Datadog
- Automate CI/CD pipelines and infrastructure using Terraform and Ansible
0 views 0 saves 0 applications