Senior Site Reliability Engineer at ActiveCampaign (Remote)
What your day could consist of:
- Be the go to person for all of reliability engineering for project, incident and technical issues
- Establish engineering excellence in SRE by driving observability, scalability, high availability, reliability and sustainability of the platform
- Available for major incidents impacting the platform and customers
- Driving and contributing to infrastructure as a code project
- Work on AWS Public cloud with focus on developing and providing self service infrastructure
- Work closely to support and elevate geographically distributed product, security, and platform engineering teams on technical challenges and process improvements
- Guide SRE on implementing automation and efficiencies in managing infrastructure patching to achieve compliance, upgrading to eliminate tech debt
- Measure and guide intervention against toil to ensure engineering time is protected for SRE, DBRE, Security, Product Engineering, DevOps Engineering teams in the company
- Design, implement and integrate management solutions to effectively manage public cloud implementation (Docker, Kubernetes, Service Mesh) and monolith application deployed in multi-region across globe, ensuring reliability, elasticity, performance and security
- Support teams coming up to speed on new services they own
- Establish + mature standards and integration for infrastructure management domains - logging, monitoring, configuration management and orchestration
- Develop cloud and container management platform standards and capabilities, gain insights of the workflows of Product Development, Engineering and Operations teams, ensure platform relevance and drive adoption
- Collaborate with technical leadership and staff engineers across the organization to build the platform to cater the evolving needs of product engineering and SaaS delivery
- AWS and other SaaS tools governance, optimization and rightsizing
What is needed:
- 7+ years of software , DevOps, site reliability engineering experience
- 3+ public cloud experience - combination of cloud native and Open Source tools
- Mandatory experience driving or contributing to infrastructure as a code
- Mandatory experience on EKS or Native K8 workloads
- Exposure to moving monolithic applications to K8 a plus
- Experience with pulumi, cloudformation, terraform or similar IaC
- Experience coding with python, ruby, php, go, or shell scripting
- Working with distributed in-memory datastores like Redis and Memcached
- Experience working globally distributed teams supporting multi-region instances
- Experience with OS, hosting PHP/Python/MySQL based SaaS applications
- High volume, low latency and high throughput services experience
- Designing and implementing various access control models including authentication and authorization
- Experience in AWS IaaS and PaaS services is highly preferred
- Good communication and collaboration skills
- Self-motivated and strong sense of ownership of tasks
- Ability to lead and mentor 1-3 engineers working on focused SRE activities