Senior Software Engineer - Infrastructure Reliability

Reposted 14 Days Ago
Be an Early Applicant
Bangalore, Bengaluru Urban, Karnataka, IND
In-Office
Senior level
Software
The Role
As a Senior Software Engineer, you'll improve platform reliability, investigate production failures, and collaborate with teams on solutions. Key technologies include Go, RabbitMQ, Kubernetes, and cloud services.
Summary Generated by Built In

At JFrog, we’re reinventing DevOps to help the world’s greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if you’re willing to do more, your career can take off. And since software plays a central role in everyone’s lives, you’ll be part of an important mission. Thousands of customers, including the majority of the Fortune 100, trust JFrog to manage, accelerate, and secure their software delivery from code to production -- a concept we call “liquid software.” Wouldn't it be amazing if you could join us in our journey?

 

Location: Bangalore (Hybrid)

Position Overview

We are seeking a Senior Software Engineer to join our Security Product team, focused on improving the reliability and resilience of our platform across customer environments. You will be embedded within the engineering team, investigating system outages and failures, identifying recurring patterns, and driving fixes - either independently or in collaboration with service owners. You will work closely with production engineering and SRE teams to build playbooks, conduct post-incident reviews, and ensure problems are properly addressed at their root cause.

Key Responsibilities

• Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP).

• Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps.

• Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved.

• Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution.

• Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches.

• Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations.

• Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability.

Qualifications

• 7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems.

• Strong proficiency in Go; familiarity with Python and Helm is a plus.

• Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting.

• Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker.

• Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through.

• Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP).

• Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure.

• Excellent analytical and problem-solving skills with a methodical approach to debugging.

• Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams.

Good to have

• Experience with artifact management or software supply chain tools (e.g., JFrog Artifactory, JFrog Xray).

• Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch, Coralogix).

• Experience with infrastructure-as-code tools (Terraform, Helm, Ansible).

• Prior experience in a customer-facing technical role (escalation engineering, support engineering, or field engineering).

• Familiarity with AI-assisted development tools - experience with skills, rules, hooks, and setting up Agents for developer workflows.

Application Instructions

Interested candidates must submit their latest resume and a cover letter detailing their relevant experience.

Applications may be submitted via the company's career page.

 

NOTE: This is a hybrid role (3 days per week mandatory work from office). We are located in Bellandur, Bangalore.

Skills Required

  • 7+ years of experience in software engineering
  • 3+ years focused on debugging infrastructure problems
  • Strong proficiency in Go
  • Deep experience with RabbitMQ or similar message brokers
  • Solid knowledge of Kubernetes and Docker
  • Strong understanding of Linux systems and cloud infrastructure
  • Experience investigating production incidents
  • Excellent analytical and problem-solving skills
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Sunnyvale, California
1,603 Employees
Year Founded: 2008

What We Do

JFrog Ltd. (Nasdaq: FROG), is on a mission to create a world of software delivered without friction from developer to device. Driven by a “Liquid Software” vision, the JFrog Software Supply Chain Platform is a single system of record that powers organizations to build, manage, and distribute software quickly and securely, ensuring it is available, traceable, and tamper-proof. The integrated security features also help identify, protect, and remediate against threats and vulnerabilities. JFrog’s hybrid, universal, multi-cloud platform is available as both self-hosted and SaaS services across major cloud service providers. Millions of users and 7K+ customers worldwide, including a majority of the FORTUNE 100, depend on JFrog solutions to securely embrace digital transformation. Once you leap forward, you won’t go back!

Similar Jobs

Navan Logo Navan

Senior Software Engineer

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
Easy Apply
Hybrid
Bengaluru, Karnataka, IND
3300 Employees
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
897 Employees

Vendavo Logo Vendavo

Senior Database Administrator

Artificial Intelligence • Big Data • Cloud • Software
Hybrid
Bengaluru, Karnataka, IND
450 Employees

LogicMonitor Logo LogicMonitor

Performance Specialist

Artificial Intelligence • Cloud • Information Technology • Machine Learning • Software
Easy Apply
Hybrid
2 Locations
1100 Employees

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Other • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York City, NY
100 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account