Essential Job Duties and Responsibilities
- Partner with engineering, DevOps, and product teams to understand system requirements, communicate reliability best practices, and embed a culture of shared ownership. Strong communication, empathy, and influence are key to success.
- Lead incident response efforts, facilitate root cause analysis, and drive continuous improvements post-incident. Requires composure under pressure, clear decision-making, and the ability to bring teams together in critical moments.
- Identify opportunities to reduce manual work by building and maintaining internal tools and automation pipelines. Emphasizes problem-solving, initiative, and a continuous improvement mindset.
- Leverage DataDog to enhance system visibility, improve alerting strategies, and ensure observability across services. Requires proactive thinking, a focus on end-user impact, and the ability to coach teams on effective usage of monitoring tools.
- Develop and maintain documentation including runbooks, service readiness guides, and knowledge articles to support operational excellence. Strong written communication and a focus on clarity are essential.
- Collaborate with teams to support scaling initiatives and optimize system performance using data-informed insights. Requires strategic thinking, collaboration, and attention to long-term growth needs.
Required Skills, Knowledge and Abilities
- Solid understanding of the Software Development Lifecycle (SDLC), including source control, defect tracking, automated build systems, and production control processes
- Strong knowledge of CI/CD and DevOps principles, tools, and integrations
- Hands-on experience with Amazon Web Services (AWS), including services such as DynamoDB, CloudFormation, CloudFront, S3, Route53, Lambda, and YAML configuration
- Proficiency with containerization and serverless technologies
- Experience with infrastructure as code tools, particularly Terraform and Kubernetes
- Strong understanding of observability concepts, including tracing, structured logging, and metrics
- Experience using application and infrastructure monitoring tools—specifically DataDog—to ensure system health and performance
- Familiarity with designing and implementing self-healing, fault-tolerant, and autoscaling systems
- Experience working with SQL and relational databases; familiarity with MongoDB Cloud Atlas is a plus
- Proficiency with Git and source control workflows; understanding of change management best practices
- Demonstrated problem-solving and analytical skills in fast-paced environments
- Excellent verbal and written communication skills, with the ability to explain complex technical topics to both technical and non-technical stakeholders
- Self-motivated with a strong sense of ownership, accountability, and follow-through
Similar Jobs
What We Do
GoodLeap is a sustainable home solutions marketplace. We provide simple, fast, and frictionless point-of-sale technology for countless mission-driven professionals serving millions of people who want to upgrade their homes and save money.
Our platform offers flexible ways for consumers to pay for a wide range of sustainable products, including solar panels, battery storage, smart home devices, modern HVAC systems, energy efficient windows, upgraded roofing, water-saving turf, and more.
We are committed to caring for the planet, building lasting relationships with our valued partners and customers, and delivering cutting-edge technology that enables more people to embrace a sustainable lifestyle.
.png)
.png)





