Jobs//Houston, TX//Developer + Engineer//

Lead Site Reliability Engineer

Reonomy, an Altus Group Business

Sorry, this job was removed at 1:11 p.m. (CST) on Wednesday, May 18, 2022

View 38565 Jobs

Find out who's hiring in Houston, TX.

See all Developer + Engineer jobs in Houston, TX

View 38565 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Job Category:

Information Technology

Opportunity Awaits at Altus Group!

The opportunity

Reporting to the Cloud Engineering and Operations Manager, we need a Lead Site Reliability Engineer with an infrastructure operations background to join our team. We need a self-starter who is excited by the opportunity to support the migration, availability, security, and releases of multiple global cloud products. You may be a Team Lead or Manager looking to return to hands-on work. Or, you are a Senior SRE seeking the next step to lead the technical work. You will focus on improving the reliability, resiliency, and scalability of our customer-facing products in AWS Cloud. You will evaluate and evolve our current Cloud operational practices, procedures and tooling. As part of the team, you will continue to respond to availability alerts and security issues and plan for timely updates and releases.

What’s in it for you?

Impact. You want to move into a technical lead role where you can apply your insights and experience in infrastructure operations to spearhead our reliability engineering and bridge the gap to Cloud DevOps. It’s a chance to be both strategic and hands-on, identifying a framework for optimal configuration, automation, and backups.

Career development. As the company grows, so will the advancement opportunities including a progression into team leadership while remaining hands-on, working on our full range of products, moving further into Cloud Systems or DevOps, and up to a Solutions Architect.

Cutting-edge technology. We are undergoing an exciting digital transformation, working on the bleeding edge and moving to serverless technology. Our Agile teams are passionate about technology, involved at different levels, and cross-training on our products. You will work with Cloud engineer generalists, building and sharing a mutual foundation of AWS knowledge, Site Reliability, and Cloud DevOps. There is endless scope for learning, including the opportunity to undertake your AWS Database Speciality certification, among others.

What you will do:

Site Reliability Planning and Execution. You will take ownership of service readiness from a Cloud reliability perspective for new AWS/public cloud technologies introduced in the enterprise. You will define the Cloud hosting service branding tier (Gold, Silver, Bronze) that corresponds to the level of service assurance required for an application, defining and assigning SLOs (Availability, Reliability, and Observability ) and designing compliance dashboards against the SLO targets.

Set enterprise expectations. You will create and execute a configuration management strategy and an automation strategy for the enterprise. You will implement a Site reliability monthly service review and host a show-case call for the leadership to highlight how we are tracking Operational Site Reliability.

Performance improvement. You will gather and analyze metrics from systems and applications to assist in performance tuning and fault finding. You will implement dashboards in an Observability tool to help surface performance patterns that need attention and work with Development and QA teams to fix them. You will work with QA teams in performance testing and help them isolate performance bottlenecks.

Provide collaboration and insights. You will advise and guide leadership in technical solutions and participate directly in investigating performance and availability issues. You will advocate for better practices and implementation across DevOps teams to unify and improve practices. In addition, you will participate in system design consulting, platform management, and capacity planning. You will analyze RCAs and create a service health scorecard to highlight opportunities for remediation.

Automation. You will build and integrate automation playbooks for every alert for Incident response. You will reduce manual toil and apply software engineering skills to IT Operations from automated OS patching to rolling out a configuration management tool and capabilities.

You will have:

The education. You have a degree in Computer Science, Engineering, or Math. You have AWS Solution Architect Associate Certification, and you are pursuing Security certification.

The experience. You have worked as a Site Reliability Engineer. You come with a blend of engineering and cloud administration experience. You have the skill set to apply sound engineering principles, operational discipline, mature automation, and best practices you have previously put into practice, as well as the latest in the industry focusing on availability, reliability, and performance.

The interpersonal skills. You can build trusting relationships at any level of the organization. You respect diverse approaches and can champion your own choices. You have flexible communication skills and comfort in creating documentation and making presentations. You thrive working across inter-disciplinary groups that bring teams together and build great products.

Come realize your potential at Altus Group!

Altus Group is committed to fostering an inclusive and accessible environment where employees feel valued and respected, and where every employee has the opportunity to realize their potential. We are committed to providing reasonable accommodations, if required, and will work with you to meet your needs. If you are a person with a disability and require assistance during the application process, please contact us at [email protected] or 416-641-9500.

Read Full Job Description

Lead Site Reliability Engineer

Similar Jobs