[Foundations] Staff Site Reliability Engineer

Checkr

| Remote

Sorry, this job was removed at 11:09 a.m. (CST) on Tuesday, May 17, 2022

View 29885 Jobs

Find out who’s hiring remotely Nationwide

See all Remote jobs Nationwide

View 29885 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Checkr’s mission is to build a fairer future by designing technology to create opportunities for all. We believe all candidates, regardless of who they are, should have a fair chance to work. Established in 2014 and valued at $5B, Checkr is using technology to bring hiring to the next level. Our People Trust Platform uses machine learning to help thousands of companies modernize their background check process and make hiring safer, more efficient, and more inclusive. Some of our customers include Uber, Instacart, Doordash, Netflix, Compass Group, and Adecco.

A career with Checkr is an opportunity to work with some of the best and brightest minds, disrupt an industry for a better future, and give otherwise overlooked candidates access to employment. Checkr has been recognized in Forbes Best Startup Employers and is a top Y Combinator company by valuation.

We’re looking for a Staff Site Reliability Engineer (SRE) with extensive observability experience. In this role you will help to lead the administration of tools like DataDog, Sentry, and PagerDuty, identify strategies to improve our full-stack telemetry and monitoring capabilities, and mentor other SREs who contribute to observability-related work.

SREs work cross-functionally with Core Infrastructure, Platform and Product Engineering, combining operations work with software engineering principles to enable high-availability of Checkr’s production systems. You will serve as a partner to our Product Engineering teams to help make their services more performant, scalable, observable, and reliable. We believe every engineering team at Checkr should be responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to make that happen.

We are growing and evolving the SRE team to help meet Checkr’s product-first reliability goals for 2022 and beyond. Having established a strong foundation--including a containerized microservices architecture (AWS, Kong, Kubernetes, Kafka, MySQL and MongoDB), CI/CD, full-stack monitoring, structured incident response and a blameless postmortem culture--we are focused on implementing new capabilities like:

Automating observability and alerting across an ever-changing landscape of microservices
Automated Service Reliability Scorecards and Production Readiness Standards
Chaos Engineering and Game Day Simulations to discover and test fixes for weak spots that would otherwise not be identified until a real-life production incident occurred
Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we’ve never considered before

What a typical week may look like at Checkr

Expand and improve our observability and monitoring footprint
Collaborate with the Engineering Manager, other SREs and Cloud Infrastructure Engineers to create architectural plans, define project requirement and establish technical standards
Review the work of other team members, help them get unblocked, and provide mentoring
Improve common operational challenges by building tools and automating scripts
Serve as the on-call incident commander to help debug and drive resolution of production reliability issues, contribute to the postmortem, and work to prevent recurrence
Participate in design and production reviews for new features, products, or infrastructure
Audit and tune the configuration of systems owned by other engineering teams
Plan for the growth of Checkr’s infrastructure and infrastructure reliability/resiliency

What we value in a Staff Site Reliability Engineer

SREs combine some level of experience in both software engineering and operations and may hail from a variety of backgrounds and job titles including: production or application engineers, software developers with a strong DevOps mindset, SysAdmins with solid systems and programming skills, Cloud Infrastructure or DevOps engineers. We are looking for someone with the following experience:

10+ years working in a relevant role, including 3+ years of technical leadership experience mentoring more junior engineers
3+ years of experience architecting and administrating observability stacks, either managed or self-hosted (e.g. DataDog, New Relic, Prometheus, Elastic Stack/ELK)
Operation of containerized microservices running on public cloud, asynchronous event processing, and databases
Strong command of Linux, Git and CI/CD pipelines
On-call support of highly available production systems
Design and build new tools to automate repetitive tasks, prevent incidents or improve TTR using an object oriented programming language such as Python
Infrastructure as Code using tools like Terraform, Terragrunt or Cloud Formation
Understand how application components interact, and contribute to architectural discussions
Unwavering commitment to operational security and best practices
Ownership: identify problems but also propose solutions, then go out and implement them--from submitting a merge request on another team’s repository to scoping out a new reliability project
Connection: motivated to help other teams improve their service reliability through reviews, pair programming, hands-on training and continuous improvement of tooling and serviceUnited States, Remote

#LI-REMOTE

What you get

A fast-paced and collaborative environment
Learning and development allowance
Competitive compensation and opportunity for advancement
100% medical, dental, and vision coverage
Up to 25K reimbursement for fertility, adoption, and parental planning services
Flexible PTO policy
Monthly wellness stipend, home office stipend

Equal Employment Opportunities at Checkr

Checkr is committed to hiring talented and qualified individuals with diverse backgrounds for all of its tech, non-tech, and leadership roles. Checkr believes that the gathering and celebration of unique backgrounds, qualities, and cultures enriches the workplace.

Checkr also welcomes the opportunity to consider qualified applicants with prior arrest or conviction records. Checkr’s commitment to diversity extends to hiring talented individuals in spite of a prior criminal history in accordance with local, state, and/or federal laws, including the San Francisco’s Fair Chance Ordinance.

Read Full Job Description

[Foundations] Staff Site Reliability Engineer

Similar Jobs