Sr. Site Reliability Engineer

Reposted Yesterday
Be an Early Applicant
Washington, DC, USA
In-Office
Senior level
Big Data • Analytics • Business Intelligence • Big Data Analytics
The Role
Seeking a Site Reliability Engineer to manage AI platform reliability, automate tasks, optimize ML pipelines, and lead incident response in a hybrid engineering role.
Summary Generated by Built In
Role Overview

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps—bridging the gap between model development and production-grade reliability.

Key Responsibilities1. Reliability & Performance Engineering
  • SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
  • Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
  • Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.
2. MLOps & AI Infrastructure
  • Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
  • GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
  • Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
3. Automation & Orchestration (Eliminating "Toil")
  • Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
  • CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
  • Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
4. Monitoring, Alerting & Incident Response
  • Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
  • Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
  • Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Requirements

Orchestration: Expert-level knowledge of Kubernetes (K8s) and Docker.

MLOps Stack: Familiarity with tools such as Kubeflow, Vertex AI, MLflow, or DVC.

Scripting: Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.

Data Systems: Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).

Networking: Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).


Benefits

Benefits

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.

Skills Required

  • Expert-level knowledge of Kubernetes and Docker
  • Familiarity with ML tools like Kubeflow and Vertex AI
  • Strong proficiency in Python and Bash
  • Experience managing reliability of data-intensive services
  • Solid understanding of networking principles

Tiger Analytics Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Tiger Analytics and has not been reviewed or approved by Tiger Analytics.

  • Fair & Transparent Compensation Feedback suggests pay is viewed as fair and market-aligned for many roles and geographies. Consistent, on-time pay and competitive packages in key markets reinforce a generally positive baseline.
  • Healthcare Strength Feedback suggests U.S. medical coverage is strong, with administration via a known benefits platform and plan options seen positively. Health insurance is often regarded as a bright spot within the package.
  • Leave & Time Off Breadth Feedback suggests generous PTO, paid sick days and holidays, and flexible PTO alongside remote-work options. These elements indicate broad time-off provisions available on paper.

Tiger Analytics Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Bengaluru, Bengaluru
5,000 Employees
Year Founded: 2011

What We Do

Tiger Analytics is a global leader in AI and Analytics, helping Fortune 1000 companies solve their toughest challenges. We offer fullstack AI and analytics services & solutions to empower businesses to achieve real outcomes and value at scale. We are on a mission to push the boundaries of what AI and analytics can do to help enterprises navigate uncertainty and move forward decisively. Our purpose is to provide certainty to shape a better tomorrow. Our team of 4000+ technologists and consultants are based in the US, Canada, the UK, India, Singapore, and Australia, working closely with clients across CPG, Retail, Insurance, BFS, Manufacturing, Life Sciences, and Healthcare. We are Great Place to Work-Certified™ and have been recognized by analyst firms such as Forrester, Gartner, Everest, ISG, HFS, and others. Ranked among the ‘Best’ and ‘Fastest Growing’ analytics firms lists by Inc., Financial Times, Economic Times and Analytics India Magazine. In India, our offices are located in Chennai, Hyderabad and Bangalore.

Similar Jobs

Order.co Logo Order.co

Senior Site Reliability Engineer

eCommerce • Fintech • Payments • Software
Remote or Hybrid
United States
120 Employees
175K-200K Annually

MetroStar Logo MetroStar

Senior Site Reliability Engineer

Information Technology • Consulting
In-Office
Washington, DC, USA
250 Employees
185K-230K Annually

CoverMyMeds Logo CoverMyMeds

Senior Site Reliability Engineer

Healthtech • Information Technology • Software
In-Office or Remote
2 Locations
1517 Employees
132K-221K Annually

Socure Logo Socure

Senior Software Engineer

Artificial Intelligence • Machine Learning • Software • Analytics
Remote or Hybrid
4 Locations
386 Employees
160K-180K Annually

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Scotch Thumbnail
Artificial Intelligence • eCommerce • Fintech • Payments • Retail • Software • Analytics
US
35 Employees
Milestone Systems Thumbnail
Artificial Intelligence • Security • Software • Analytics • Big Data Analytics
Lake Oswego, OR
1500 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account