Site Reliability Engineer (Raleigh, NC) Duties: Work closely with leadership and internal partners to ensure that software meets security, SLA, performance, and capacity requirements. Set up and maintain monitoring tools and systems to detect issues using Datadog Monitors and Alert using OpsGenie. Configure Datadog and Grafana alerts and Application Health Monitors to notify the team when anomalies or problems occur. Work closely with other Site Reliability Engineers, DevOps Engineers, and System Administrators to achieve common goals. Analyze system performance data using Snowflake to plan for capacity upgrades or optimizations. Ensure the system can handle expected growth in traffic and data using the tools by getting the Lags and behavior of the Application. Manage Kubernetes clusters and OpenShift environments for deploying and scaling containerized applications. Implement and manage infrastructure using Ansible and maintain version-controlled infrastructure code using Gitlab for consistency and repeatability. Use Terraform and Ansible scripts to define and provision infrastructure resources in a repeatable and automated manner. Create and maintain Ansible playbooks to automate routine tasks, configurations, and deployments. Use GitHub Actions for CI/CD activities to continuously build and deploy the code and implement CI/CD pipelines to streamline application updates. Build and maintain deployment pipelines using the Ansible Playbooks and ensure smooth and reliable deployments, rollback procedures, and create production releases using Service Now for Tracking the Records. Maintain detailed documentation on system architecture, configurations, and processes using Confluence and Share knowledge and best practices with team members. Plan for resource allocation using Red Hat OpenShift including servers, storage, and network capacity, following the Kubernetes Architecture to ensure the system is equipped to handle traffic spikes and growth. Develop and test disaster recovery plans to ensure data and service availability in case of major failures or disasters by creating the tools using the Go. Work closely with development teams to promote a DevOps culture and ensure reliability is built into software from the start by following best practices. Collaborate with other Site Reliability Engineers to share knowledge and solve complex problems on a weekly basis and touch base all the points. Monitor and manage cloud resource costs in AWS to optimize spending while maintaining performance.
Required: Master’s degree or foreign equivalent in Computer Science, Electrical Engineering, or related field of study plus 2 years of experience in the job offered or related position. Must have experience 2 years of experience with: Infrastructure and networking concepts including virtualization, load balancing, and DNS. At least one of the following cloud infrastructure technologies AWS, Google Cloud, Azure. REST APIs using at least one or more of the following (JSON, XML, YAML). Designing, building, and operating large-scale production systems. Continuous Integration and Continuous Deployment (CI/CD) concepts and technologies using at least one or more of following (Jenkins, GHA, Circle). Containerization technologies (Docker, Docker Compose, Docker Swarm, Kubernetes). Configuration and management techniques in large distributed environments. Monitoring and observability techniques with at least one or more of the following tools Datadog, Sensu, New Relic, Nagios. General use of open-source databases MySQL, Postgres, Redis, Cassandra. Unix/Linux administration, troubleshooting and shell scripting. At least one or more of the following programming languages Python, Java, Go, Rust, or similar. Source control (Git, GitHub) and feature branching strategies. Automating infrastructure, testing, and deployment using tools Ansible, Chef, or Terraform. Infrastructure as Code paradigm.
Or in the alternate will accept a Bachelor’s degree or foreign equivalent in Computer Science, Electrical Engineering or related field of study plus 5 years of experience in the job offered or related position. Must have experience 2 years of experience with: Infrastructure and networking concepts including virtualization, load balancing, and DNS. At least one of the following cloud infrastructure technologies AWS, Google Cloud, Azure. REST APIs using at least one or more of the following (JSON, XML, YAML). Designing, building, and operating large-scale production systems. Continuous Integration and Continuous Deployment (CI/CD) concepts and technologies using at least one or more of following (Jenkins, GHA, Circle). Containerization technologies (Docker, Docker Compose, Docker Swarm, Kubernetes). Configuration and management techniques in large distributed environments. Monitoring and observability techniques with at least one or more of the following tools Datadog, Sensu, New Relic, Nagios. General use of open-source databases MySQL, Postgres, Redis, Cassandra. Unix/Linux administration, troubleshooting and shell scripting. At least one or more of the following programming languages Python, Java, Go, Rust, or similar. Source control (Git, GitHub) and feature branching strategies. Automating infrastructure, testing, and deployment using tools Ansible, Chef, or Terraform. Infrastructure as Code paradigm.
Submit resumes to: Bandwidth, Inc, 2230 Bandmate Way, Raleigh, NC 27607, Attn: Kellie Sigmon, Sr. Manager People Services or apply at www.bandwidth.com/careers/openings/. Must reference “Site Reliability Engineer” when applying.
#LI-DNI
#LI-DNP
What We Do
Bandwidth is a software company that’s transforming the way people communicate and challenging the standards of old telecom. Together with our customers, we’re unlocking remarkable value, questioning the status quo, and helping people interact with technology and one another, oftentimes in ways they never dreamed possible.
Haven’t heard of Bandwidth? You’ve probably used one of our products before. We power some of the most important communications technologies on the market today—names like Google, Skype and Ring Central to name a few. At Bandwidth, we’ve got a passion for doing things the other way—imagining what they could be and uncovering opportunities to take a new approach to create what should be. We’re out to disrupt the century-old rules of the telecom industry—and that means doing things differently in every area of our business. It’s in the way we treat our people, and how we create with our customers. Whether our engineering teams are crunching code during all-night hack-a-thons or our team members are competing in a Big Idea competition, we love to dive in and get our hands dirty. No idea at Bandwidth is too big or too small, and every voice gets a listen.
Our folks have diverse backgrounds from all over the world. Crave innovation? We live for it. No one here is getting a blue ribbon for shaking their head and agreeing with the higher ups. At Bandwidth, we don’t hold back. We speak our minds, share our ideas—and let them soar. We think, we dream, we reimagine what’s possible…because we know that’s the only way we can unlock remarkable value. And that’s what really gets us fired up.