3h ago

Senior Cloud Platform Engineer

Palo Alto, California, United States
full-timeseniorArtificial Intelligence

Tech Stack

Description

You will own the reliability, performance, and scalability of our AI Inferencing Service. You'll bridge development and operations, ensuring exceptional uptime, low latency, and efficient resource utilization through monitoring, automation, and incident management.

Requirements

  • Experience as a Senior SRE or Platform Engineer focusing on AI/ML services
  • Expertise in cloud infrastructure (AWS, GCP, Azure) and IaC tools
  • Proficiency in monitoring tools (Prometheus, Grafana, Datadog)
  • Strong skills in automation, CI/CD, and scripting languages
  • Ability to define SLOs/SLIs and manage capacity planning

Responsibilities

  • Own production inferencing service availability, latency, performance, and capacity planning across regions
  • Participate in 24/7 on-call rotation with follow-the-sun model
  • Lead incident response and blameless post-mortems
  • Develop monitoring, alerting, and dashboards using Prometheus, Grafana, Datadog
  • Automate CI/CD pipelines and infrastructure using Terraform and Ansible
0 views 0 saves 0 applications