Nexla

Senior Site Reliability Engineer

Reposted 18 Days Ago

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka, IND

In-Office

Senior level

Database

The Role

As a Senior DevOps Engineer at Nexla, you'll manage AWS EKS infrastructure, implement CI/CD pipelines, and ensure system reliability while collaborating with engineering teams.

Summary Generated by Built In

About Nexla

Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and APIs into a single design pattern. We accelerate the development of solutions for GenAI, Analytics, and Inter-company data. Nexla makes data users and developers up to 10x more productive by delivering a true blend of no-code, low-code, and pro-code interfaces.

Leading companies including DoorDash, LinkedIn, Johnson & Johnson, and LiveRamp trust Nexla for mission-critical data. Named in the 2022, 2023, and 2024 Gartner Magic Quadrant™ for Data Integration Tools and top-rated by customers on Gartner Peer Insights, headquartered in San Mateo, California.

At Nexla, our culture is built around our core values: Have Empathy, Be Curious, Be Intellectually Honest, Achieve Excellence, and Remember to Relax. We put our customers at the heart of everything we do, foster a data-driven mindset, take ownership of our work, and believe in the power of teamwork to achieve ambitious goals.

Role
You will own the reliability of the distributed data systems at the heart of Nexla - the streaming runtime and processing engines that move hundreds of billions of rows per day for top-tier enterprises. This is an SRE role for our big data stack: Kafka, Spark, Flink, Ray, Redis, and data warehouses, all running on Kubernetes.

This is not a cloud-provisioning role. We are looking for someone who has lived inside stateful, high-throughput systems in production who has chased down a broker outage, a checkpoint stall, a crashlooping cache, and a sink that silently stopped writing, and who fixes the architecture rather than the symptom. If keeping a large, busy data platform alive and fast is the kind of problem you find satisfying, you will have a lot of fun working with us. This is a unique opportunity to shape the foundation of a product that is defining the next wave of intelligent, context-aware data movement.

Responsibilities

Streaming & Data Plane Reliability: Own the health of our Kafka-based runtime (managed via Strimzi on Kubernetes) - broker health, topic lifecycle and count management, partition and throughput tuning, certificate/secret rotation, and version upgrades - at a scale of hundreds of thousands of topics and hundreds of billions of rows per day.
Distributed Processing Engines: Operate and tune distributed system workloads in production in collaboration with backend teams, resource allocation, autoscaling, checkpointing, backpressure, and failure recovery for both batch and streaming jobs.
Stateful Services: Run Redis clusters and other stateful systems reliably - failover, persistence, liveness/readiness tuning, and capacity planning under heavy and bursty load.
Kubernetes & Operators: Take end-to-end ownership of Amazon EKS, Google GKE and the operators (Strimzi and others) running our stateful data workloads - cluster lifecycle, scaling, version upgrades, and resource governance.
Observability: Build deep, data-aware monitoring - consumer lag, throughput, partition skew, job latency, error rates - not just host and CPU metrics. Make the data plane's behavior legible before it breaks.
Incident Management: Lead root-cause analysis for distributed-systems failures (broker outages, crashloops, sink decommissions, control-plane race conditions) and drive durable fixes. Mitigate fast, but design out the recurrence.
Infrastructure as Code & Automation: Provision and manage cloud infrastructure with Terraform; build operational runbooks and automation, including for air-gapped / private enterprise installs (pre-staged images, operator-facing procedures).
Collaboration: Partner with platform, runtime, and connector engineering - and with SREs and support - to ship and scale new data-movement features reliably in a large-scale Linux environment.r with SREs, L2/Support, and developers to deploy and scale new product features and improve production monitoring in a large-scale Linux environment.

Qualifications

Experience: 8+ years in infrastructure, SRE, or DevOps, with significant time spent operating production distributed data systems (not just application/cloud infra).
Kafka: Deep, hands-on operational experience running Kafka at scale in production - ideally on Kubernetes via Strimzi - including upgrades, topic/partition management, performance tuning, and TLS/secret rotation.
Distributed Processing (Strong Plus): Production experience operating one or more of Spark, Flink, or Ray - resource tuning, checkpointing, failure recovery.
Stateful Systems (Must Have): Production experience with Redis (clustering, persistence, failover) and a solid understanding of operating stateful workloads on Kubernetes (StatefulSets, PVCs, probes, operators).
Data Warehouses: Familiarity operating against Snowflake, BigQuery, or similar, and an understanding of JDBC connectivity and sink reliability.
Kubernetes & EKS: Strong hands-on EKS - cluster creation, scaling, version upgrades, and operator management.
Infrastructure as Code: Advanced proficiency with Terraform.
Programming: Proficiency in Python (or similar) for automation and tooling. Comfort reading and debugging JVM-based systems is a strong plus.
Reliability Mindset: Demonstrated ownership of incident management, RCA, capacity planning, and performance tuning for high-throughput systems.
CI/CD: Solid understanding of CI/CD methodology (Jenkins, GitHub Actions, or GitLab CI) for containerized and non-containerized apps. Supporting, not the core of the role.
Nice to Have: Configuration management (Ansible preferred); broader AWS services (IAM, VPC, EC2, S3, Lambda); AWS CloudFormation.
Soft Skills: Excellent communication and organizational skills; ability to coordinate effectively within a team and with customers.

Why This Might Be Worth It

You own the hard part. The stateful, distributed systems that move billions of rows are the platform's most demanding reliability problems - and they'd be yours.
Impact at scale from day one. Your work keeps mission-critical data flowing for companies like DoorDash and LinkedIn.
The AI wave is real for us. We're not bolting AI onto a legacy product. Intelligent connectors, context-aware data movement, and agentic workflows are the core of what we're building next - on top of the runtime you'd run.

Small team, big problems. Direct access to the CTO, real influence over product direction, and the autonomy to make significant technical bets.
Recognized platform, startup energy. Enterprise validation with the speed and ownership of an early-stage company.

Location
Pune(preferred) or Bengaluru

Why Build Your Future at Nexla? We are standing at the precipice of the GenAI revolution, but the biggest bottleneck isn't the models, it's the data. By joining Nexla, you aren’t just entering a company; you are stepping into the critical layer of the modern data stack that powers the AI economy. We are the Data Fabric that enables industry titans like LinkedIn, DoorDash, and J&J to turn messy, siloed data into ready-to-use products for RAG and predictive models. This is your opportunity to move beyond simple tooling and build the actual infrastructure that democratizes data access for the next decade of innovation. If you want to solve the hardest problems in data engineering and own a piece of a market projected to hit billions, your career belongs here.

Skills Required

8+ years of experience in DevOps and SRE
Proven experience in creating and upgrading Amazon EKS clusters
Deep understanding of CI/CD methodology
Advanced proficiency in Terraform
Experience with Ansible for configuration management
Strong understanding of Linux systems administration and networking
Hands-on experience with AWS services like IAM, VPC, EC2, S3
Proficiency in at least one programming language like Python
Excellent communication and organizational skills

View all jobs at Nexla

View Nexla Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Millbrae, CA

37 Employees

Year Founded: 2016

What We Do

Nexla is the leader in unified data operations and a 2021 Gartner Cool Vendor. Our platform makes it simple for anyone to create scalable data flows. Teams working with data get a no/low-code unified experience to integrate, transform, provision, and monitor data for any use case. Data users with varying skill levels work collaboratively to create ready to use data products. Organizations get zero-friction, governed, and agile data operations. To learn more, visit https://www.nexla.com