Site Reliability Engineer
Anduril is a defense technology company, bringing Silicon Valley talent and funding to the defense sector. Our technology helps our customers solve their toughest challenges by enabling them to make better, more informed decisions in life-and-death situations. We’ve assembled a diverse team of experts in artificial intelligence, computer vision, sensor fusion, optics, and data analysis that are creating software and hardware solutions to radically evolve the capabilities of the United States and our allies. If you are passionate about solving problems that have real impact, come join Anduril and build the future of defense.
As a Site Reliability Engineer (SRE), you will work across teams to continually improve and maintain high impact production systems. You should have an aptitude for debugging and an appetite for real time response, rapid resolution, and root causing complex issues across code and backing infrastructure. You will have the opportunity to optimize the existing systems and spearhead initializing and maintaining new systems from the ground up. A SRE at Anduril thinks proactively, resolves issues before they surface, seizes opportunity to automate, and is the first line of defense in the reliability of existing systems. As part of the Anduril SRE team, you will have a significant impact on and ownership over the reliability of systems that are critical to our national security and will tackle the complex and unique challenges that accompany it.
WHAT YOU'LL DO
- Represent and champion a resiliency culture throughout the software org
- Interface with business and product teams to identify and solve system or product reliability issues
- Improve resilience of Anduril's tech stack and deployments through hardening our observability and monitoring pipelines, scaling our systems and maturing our release process
- Triage and root cause incidents and take ownership of driving postmortem actions
- Ensure stable releases to customer environments
REQUIRED QUALIFICATIONS
- Experience with on-call and working in and updating production environments
- Experience with tools such as (but not limited to) Kubernetes, Docker, Terraform, and AWS
- Experience in troubleshooting vague issues in a microservice architecture
- U.S. Person status is required as this position needs to access export controlled data
What a successful candidate looks like
- Loves solving issues independently, decisively, and carefully in production systems
- Passionate about automation of manual tasks, and constant questioning of how operational processes can be improved or automated.
- Enjoys incident response: Quarterbacking outages, diving into unfamiliar systems (internal products or open source), and prioritizing time-to-resolution.
- Desire to deeply understand complex systems to identify brittleness or scaling limitations, and collaborate with and drive long-term improvements with respective engineering teams.
- Not satisfied with a fix until the issue is resolved for the end-user.
Anduril is an equal-opportunity employer, and we encourage candidates from all backgrounds to apply. If you are someone passionate to work on problems that matter, we’d love to hear from you!