Site Reliability Engineer

Posted 2 Hours Ago
Be an Early Applicant
Hyderabad, Telangana, IND
In-Office
Senior level
Artificial Intelligence • Cloud • Fintech • Machine Learning • Software • Financial Services • Automation
Leading provider of Autonomous Finance Solutions
The Role
The Site Reliability Engineer will manage cloud infrastructure, enhance service reliability, automate operations, respond to incidents, and collaborate with teams on cloud solutions.
Summary Generated by Built In

Job Summary:

We are looking for a highly skilled and adaptable Site Reliability Engineer (6+ Years) to become a key member of our Cloud Engineering team. In this crucial role, you will be instrumental in designing and refining our cloud infrastructure with a strong focus on reliability, security, and scalability. As an SRE, you'll apply software engineering principles to solve operational challenges, ensuring the overall operational resilience and continuous stability of our systems. This position requires a blend of managing live production environments and contributing to engineering efforts such as automation and system improvements.

Key Responsibilities:

Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions to support the development and deployment of scalable and reliable applications. This includes managing and optimizing cloud platforms for high availability, performance, and cost efficiency.
Enhancing Service Reliability: Lead reliability best practices by establishing and managing monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and improve reliability. Identify and resolve potential bottlenecks and areas for enhancement.
Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Identify and implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins.
Incident Response and Resolution: Participate in and help resolve major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach.
Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of
Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and
components, and implement software and tools to enhance resilience and automate operations. Be open to adopting new tools and approaches as needed.

Required Skills and Experience:

● Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP).
Containerization Tools: Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies.
Automation & IaC: Proficiency in programming languages (Golang or Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools
(Terraform or CloudFormation).
Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications.
Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting.
Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. We value proactive individuals who are eager to learn and adapt in a dynamic environment. Must possess a pragmatic and adaptable mindset, with a willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Must be a change advocate for reliability.

Skills Required

  • Expertise in at least one major cloud platform (AWS, Azure, or GCP)
  • Extensive experience with Docker and Kubernetes
  • Proficiency in Golang or Python
  • Experience with Ansible or Puppet
  • Exposure to Terraform or CloudFormation
  • Experience with Prometheus, Grafana, or ELK stack
  • Strong understanding of SLIs, SLOs, and SLAs
  • Excellent communication and interpersonal skills
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Houston, Texas
3,000 Employees
Year Founded: 2006

What We Do

HighRadius leverages AI to help enterprise businesses across different industries automate financial processes like accounts receivable, treasury, and accounting, allowing them to operate more efficiently. By using our Autonomous Finance Platform, companies can reduce manual work, speed up payment collection, and improve overall finance operations, allowing our customers to focus on growth rather than being bogged down by paperwork and manual processes.

Why Work With Us

We don’t just lead the market, we help define it. HighRadians use their curiosity, grit, and humility to take on new challenges and solve big problems for some of the worlds largest companies. Not only that, career growth and cool perks are the byproducts of running a successful business.

Gallery

Gallery

Similar Jobs

MetLife Logo MetLife

Site Reliability Engineer

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Hybrid
Hyderabad, Telangana, IND
43000 Employees

MetLife Logo MetLife

Site Reliability Engineer

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Hybrid
Hyderabad, Telangana, IND
43000 Employees
10-10 Annually

Vertafore Logo Vertafore

Site Reliability Engineer

Information Technology • Insurance • Software
Hybrid
Hyderabad, Telangana, IND
2372 Employees
In-Office
Hyderabad, Telangana, IND
22000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
31 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account