Senior Site Reliability Engineer at Unanet (Remote)
As a member of our Cloud Operations group, you will help define our transformation towards an enterprise SaaS solution, hosting numerous top-tier customers. With a quickly growing customer base, we need creative and dynamic engineers to help architect innovative solutions that ensure the best possible experience for our customers.
You will join a team of talented, fast-moving engineers and administrators involved in nearly every aspect of the SaaS delivery and customer experience lifecycle. We are looking for an engineer with a strong software development background, one who has experience building services to ensure our Cloud Operations are proactive and efficient. Your success will hinge on your ability to apply software engineering methods to the functions of operations as well as a firm grasp of automation, cloud architectures, event monitoring, health checks, and metrics gathering. You should be passionate about solving problems and developing creative solutions leveraging automation.
What You’ll Do
- Provision, configure, and maintain the production environment to handle running several application stacks in the cloud that can scale to the thousands of customers using our products as well as our internal Product team
- Automate the deployment and maintenance of cloud platform technologies
- Aid in improving overall product through development of management automation and metrics analysis
- Integrate current scripts, automations, and functions spread across multiple tools into a coherent Cloud Control system
- Collaborate with Cloud Architecture team on automation initiatives
- Evaluate metrics for proactive trend analysis and alerting
- Create metrics-based performance dashboards for consumption by leadership and development teams.
- Implement and oversee log management, data warehouse, and database operations, including management of Logging/Audit services
- Ensure all monitoring systems (infrastructure- and application-level) are in place, report on availability and system health
- Implement strategies around disaster recovery and security for all sub-systems in infrastructure (e.g., web servers, database, queues, storage, network)
- Contribute to strategic and tactical plans for continued improvement of cloud architecture and operations
- Perform capacity management, load and scalability planning
- Help drive process improvements for service management, including outage/incident management, rollbacks, health checks and reporting
- Assist management in development and optimization of operational cost models
- Assist in the enhancement of 24x7 performance monitoring, reporting, and response protocols
- With the support of Cloud Architecture and Product Development, provide on-call support outside of normal work hours/days
Your First 90 Days
In your first 30 Days, as your familiarity with the product and pipeline grows, your responsibilities and influence will grow as well. You will immerse yourself in the daily operation of the production cloud environment, including provisioning new infrastructure, reviewing metrics and alerts, troubleshooting, and blameless incident postmortems. You will become familiar with our tech stack for each product as well as our management and observability tech stacks.
Within your first 60 Days, working with the rest of the Cloud Operations team, you will be responsible for identifying procedures currently handled manually or not fully automated. You will become familiar with existing automations and management services which can be enhanced. You will shadow Cloud Operations Engineers for each product line to gain understanding of each product’s unique architecture and management needs.
Within your first 90 Days, you will collaborate with our Director of Cloud Operations to define goals for the transition of Cloud Operations to a true SRE practice. Working with our Cloud Architecture team, you will identify the gaps between lower and upper delivery environments. You, along with the rest of our Cloud Operations team, will be responsible for supporting production environments.
About You
- 2+ years of hands-on experience as a production SRE, managing an environment of 500+ containers over 50+ namespaces
- 4+ years of hands-on software development experience with applications and RESTful APIs architected for cloud
- Performance optimization experience, including troubleshooting and resolving network and server latency issues, performing hardware evaluation/selection tasks, performance vs. cost vs. time analysis
- 2+ year(s) of experience with Kubernetes and Terraform
- 2+ year(s) of experience with automation or scripting languages (e.g., GO, Python, Shell)
- Working knowledge of Agile Development practices (e.g., SCRUM, TDD)
- Detail-oriented, with excellent documentation skills, and ability to successfully manage multiple priorities
- Troubleshooting skills that range from diagnosing hardware/software issues to large scale failures within a complex infrastructure
Your Differentiators
- Bachelor’s Degree in Computer Science
- Experience implementing production Docker/Kubernetes environments
- Experience deploying and maintaining infrastructure in AWS
- Experience with Relational Databases (e.g., Oracle, Aurora or Postgres)
- Experience with Splunk (or other log aggregation tools), Grafana, and Prometheus
Our Values
- We are a Team. Employees, customers, and partners working together.
- We are Customer-Focused. Customers are the heart of everything we do.
- We are Driven. Seeking exceptional outcomes.
- We Own our Success. Every employee has a stake in our company.
- We do the right thing and have fun in the process.
The salary range for this opportunity is $132,000 - $145,000 per year. You will be eligible for employee equity as well as discretionary bonus compensation, subject to plans that may be in effect from time to time. You will further be eligible to participate in Unanet's employee benefits plans and programs. For more details on Unanet's benefits offerings, please visit https://unanet.com/employee-benefits.
Unanet is proud to be an Equal Opportunity Employer. Applicants will be considered for positions without regard to race, religion, sex, national origin, age, disability, veteran status or any other consideration made unlawful by applicable federal, state or local laws.