Senior Site Reliability Engineer

Reposted 24 Days Ago
Be an Early Applicant
Bengaluru, Bengaluru Urban, Karnataka, IND
In-Office
Senior level
Database
The Role
As a Senior DevOps Engineer at Nexla, you'll manage AWS EKS infrastructure, implement CI/CD pipelines, and ensure system reliability while collaborating with engineering teams.
Summary Generated by Built In
About Nexla

Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and  APIs into a single design pattern. We accelerate the development of solutions for GenAI, Analytics, and Inter-company data. Nexla makes data users and developers up to 10x more productive by delivering a true blend of no-code, low-code, and pro-code interfaces.

Leading companies including DoorDash, LinkedIn, Johnson & Johnson, and LiveRamp trust Nexla for mission-critical data. Named in the 2022, 2023, and 2024 Gartner Magic Quadrant™ for Data Integration Tools and top-rated by customers on Gartner Peer Insights, headquartered in San Mateo, California.

At Nexla, our culture is built around our core values: Have Empathy, Be Curious, Be Intellectually Honest, Achieve Excellence, and Remember to Relax. We put our customers at the heart of everything we do, foster a data-driven mindset, take ownership of our work, and believe in the power of teamwork to achieve ambitious goals.

Role
You will own the reliability of the distributed data systems at the heart of Nexla - the streaming runtime and processing engines that move hundreds of billions of rows per day for top-tier enterprises. This is an SRE role for our big data stack: Kafka, Spark, Flink, Ray, Redis, and data warehouses, all running on Kubernetes.

This is not a cloud-provisioning role. We are looking for someone who has lived inside stateful, high-throughput systems in production who has chased down a broker outage, a checkpoint stall, a crashlooping cache, and a sink that silently stopped writing, and who fixes the architecture rather than the symptom. If keeping a large, busy data platform alive and fast is the kind of problem you find satisfying, you will have a lot of fun working with us. This is a unique opportunity to shape the foundation of a product that is defining the next wave of intelligent, context-aware data movement.

Responsibilities

  • Streaming & Data Plane Reliability: Own the health of our Kafka-based runtime (managed via Strimzi on Kubernetes) - broker health, topic lifecycle and count management, partition and throughput tuning, certificate/secret rotation, and version upgrades - at a scale of hundreds of thousands of topics and hundreds of billions of rows per day.
  • Distributed Processing Engines: Operate and tune distributed system workloads in production in collaboration with backend teams, resource allocation, autoscaling, checkpointing, backpressure, and failure recovery for both batch and streaming jobs.
  • Stateful Services: Run Redis clusters and other stateful systems reliably - failover, persistence, liveness/readiness tuning, and capacity planning under heavy and bursty load.
  • Kubernetes & Operators: Take end-to-end ownership of Amazon EKS, Google GKE and the operators (Strimzi and others) running our stateful data workloads - cluster lifecycle, scaling, version upgrades, and resource governance.
  • Observability: Build deep, data-aware monitoring - consumer lag, throughput, partition skew, job latency, error rates - not just host and CPU metrics. Make the data plane's behavior legible before it breaks.
  • Incident Management: Lead root-cause analysis for distributed-systems failures (broker outages, crashloops, sink decommissions, control-plane race conditions) and drive durable fixes. Mitigate fast, but design out the recurrence.
  • Infrastructure as Code & Automation: Provision and manage cloud infrastructure with Terraform; build operational runbooks and automation, including for air-gapped / private enterprise installs (pre-staged images, operator-facing procedures).
  • Collaboration: Partner with platform, runtime, and connector engineering - and with SREs and support - to ship and scale new data-movement features reliably in a large-scale Linux environment.r with SREs, L2/Support, and developers to deploy and scale new product features and improve production monitoring in a large-scale Linux environment.

Qualifications

  • Experience: 8+ years in infrastructure, SRE, or DevOps, with significant time spent operating production distributed data systems (not just application/cloud infra).
  • Kafka: Deep, hands-on operational experience running Kafka at scale in production - ideally on Kubernetes via Strimzi - including upgrades, topic/partition management, performance tuning, and TLS/secret rotation.
  • Distributed Processing (Strong Plus): Production experience operating one or more of Spark, Flink, or Ray - resource tuning, checkpointing, failure recovery.
  • Stateful Systems (Must Have): Production experience with Redis (clustering, persistence, failover) and a solid understanding of operating stateful workloads on Kubernetes (StatefulSets, PVCs, probes, operators).
  • Data Warehouses: Familiarity operating against Snowflake, BigQuery, or similar, and an understanding of JDBC connectivity and sink reliability.
  • Kubernetes & EKS: Strong hands-on EKS - cluster creation, scaling, version upgrades, and operator management.
  • Infrastructure as Code: Advanced proficiency with Terraform.
  • Programming: Proficiency in Python (or similar) for automation and tooling. Comfort reading and debugging JVM-based systems is a strong plus.
  • Reliability Mindset: Demonstrated ownership of incident management, RCA, capacity planning, and performance tuning for high-throughput systems.
  • CI/CD: Solid understanding of CI/CD methodology (Jenkins, GitHub Actions, or GitLab CI) for containerized and non-containerized apps. Supporting, not the core of the role.
  • Nice to Have: Configuration management (Ansible preferred); broader AWS services (IAM, VPC, EC2, S3, Lambda); AWS CloudFormation.
  • Soft Skills: Excellent communication and organizational skills; ability to coordinate effectively within a team and with customers.

Why This Might Be Worth It

  • You own the hard part. The stateful, distributed systems that move billions of rows are the platform's most demanding reliability problems - and they'd be yours.
  • Impact at scale from day one. Your work keeps mission-critical data flowing for companies like DoorDash and LinkedIn.
  • The AI wave is real for us. We're not bolting AI onto a legacy product. Intelligent connectors, context-aware data movement, and agentic workflows are the core of what we're building next - on top of the runtime you'd run.
    • Small team, big problems. Direct access to the CTO, real influence over product direction, and the autonomy to make significant technical bets.
    • Recognized platform, startup energy. Enterprise validation with the speed and ownership of an early-stage company.

Location
Pune(preferred) or Bengaluru

Why Build Your Future at Nexla? We are standing at the precipice of the GenAI revolution, but the biggest bottleneck isn't the models, it's the data. By joining Nexla, you aren’t just entering a company; you are stepping into the critical layer of the modern data stack that powers the AI economy. We are the Data Fabric that enables industry titans like LinkedIn, DoorDash, and J&J to turn messy, siloed data into ready-to-use products for RAG and predictive models. This is your opportunity to move beyond simple tooling and build the actual infrastructure that democratizes data access for the next decade of innovation. If you want to solve the hardest problems in data engineering and own a piece of a market projected to hit billions, your career belongs here.

Skills Required

  • 8+ years of experience in DevOps and SRE
  • Proven experience in creating and upgrading Amazon EKS clusters
  • Deep understanding of CI/CD methodology
  • Advanced proficiency in Terraform
  • Experience with Ansible for configuration management
  • Strong understanding of Linux systems administration and networking
  • Hands-on experience with AWS services like IAM, VPC, EC2, S3
  • Proficiency in at least one programming language like Python
  • Excellent communication and organizational skills
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Mateo, CA
37 Employees
Year Founded: 2016

What We Do

Nexla is the leader in unified data operations and a 2021 Gartner Cool Vendor. Our platform makes it simple for anyone to create scalable data flows. Teams working with data get a no/low-code unified experience to integrate, transform, provision, and monitor data for any use case. Data users with varying skill levels work collaboratively to create ready to use data products. Organizations get zero-friction, governed, and agile data operations. To learn more, visit https://www.nexla.com

Similar Jobs

JLL Technologies Logo JLL Technologies

Senior Site Reliability Engineer

Information Technology • Software
In-Office
Bengaluru, Karnataka, IND
2038 Employees

JLL Logo JLL

Senior Site Reliability Engineer

Real Estate • Financial Services
In-Office
Bengaluru, Karnataka, IND
66101 Employees

73 Strings Logo 73 Strings

Senior Site Reliability Engineer

Artificial Intelligence • Fintech • Software • Analytics
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
334 Employees

Level AI Logo Level AI

Senior Site Reliability Engineer

Artificial Intelligence • Natural Language Processing • Software • Conversational AI
Hybrid
2 Locations
122 Employees

Similar Companies Hiring

Apollo.io Thumbnail
Software • Sales • Productivity • Information Technology • Enterprise Web • Database • Artificial Intelligence
US
850 Employees
Perchwell Thumbnail
Mobile • Real Estate • Software • Database • Analytics
New York City, NY
60 Employees
Jellyfish Thumbnail
Big Data • Cloud • Productivity • Software • Database • Analytics • Automation
Boston, MA
225 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account