Manager Infrastructure Services

Reposted 17 Days Ago
Be an Early Applicant
Chennai, Tamil Nadu, IND
In-Office
Senior level
Information Technology
The Role
Lead infrastructure reliability engineering across AI, Analytics, and Automation services. Develop observability systems, automate incident responses, and mentor a high-performing team.
Summary Generated by Built In

Job Description:

Site Reliability Engineering Manager - AI, Analytics & Automation Services

We are seeking an experienced Site Reliability Engineering (SRE) Manager to lead our reliability engineering efforts across our AI, Analytics, and Automation services portfolio. You will spearhead the development of comprehensive observability pipelines, intelligent monitoring systems, and automated resolution frameworks while building and managing a high-performing team of SREs. This role is critical to ensuring the reliability, scalability, and performance of our AI solutions, Databricks analytics platform, and UiPath automation infrastructure.

Key Responsibilities

Observability & Monitoring Excellence

- Design and implement end-to-end observability pipelines spanning AI solutions, data processing workflows, and automation execution environments

- Establish comprehensive monitoring strategies for AI model performance, drift detection, data quality, and service health across Databricks and UiPath platforms

- Build real-time dashboards and alerting systems that provide actionable insights into system performance, resource utilization, and service reliability

- Develop custom metrics and KPIs specific to AI/ML workloads, including model accuracy, latency, throughput, and resource consumption

- Implement distributed tracing and logging solutions to enable rapid troubleshooting across complex AI and automation pipelines

Automated Resolution & Self-Healing Systems

- Architect and deploy automated incident response systems that can detect, diagnose, and resolve common reliability issues without human intervention

- Build intelligent event-triggered runbook automation

- Implement chaos engineering practices to proactively identify and strengthen system weaknesses

- Develop automated remediation workflows for infrastructure issues, service degradations, and capacity constraints

- Create self-healing mechanisms for AI inference services, data pipeline failures, and automation workflow interruptions

Team Leadership & Development

- Build, mentor, and lead a team of Site Reliability Engineers with expertise in AI/ML operations, data platforms, and automation technologies

- Establish SRE best practices, standards, and processes tailored to AI and automation workloads

- Foster a culture of reliability engineering, continuous improvement, and data-driven decision making

- Conduct regular performance reviews, career development discussions, and technical skill assessments

- Collaborate with engineering teams to embed reliability principles into the software development lifecycle

Platform Reliability & Performance

- Ensure near zero downtime and optimal performance of AI solutions, Databricks analytics workloads, and UiPath automation processes

- Design and implement disaster recovery and business continuity plans for critical AI and automation services

- Optimize resource allocation and cost management across cloud infrastructure supporting AI, analytics, and automation workloads

- Establish and maintain service level objectives (SLOs) and error budgets for all managed services

- Drive capacity planning initiatives to support growing AI model deployment and automation scale requirements

Cross-Functional Collaboration

- Partner with AI/ML developers to integrate reliability considerations into AI solutions and deployment pipelines

- Work closely with data engineering teams to ensure robust, monitored data flows within Databricks environments

- Collaborate with automation developers to build resilient UiPath bot deployment and execution frameworks

- Interface with security teams to implement observability solutions that maintain compliance and data protection standards

Required Qualifications

- 7+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure roles

- 2+ years of management experience leading technical teams

- Hands-on experience with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, or New Relic

- Proficiency in Infrastructure as Code (Terraform, CloudFormation, Ansible)

- Strong scripting and automation skills (Python, Go, Bash, PowerShell)

- Familiarity with Databricks platform administration, cluster management, and workflow orchestration

- Knowledge of UiPath platform architecture, orchestrator management, and bot deployment strategies

- Understanding of data pipeline monitoring, data quality validation, and ETL/ELT process reliability

- Experience with ML model monitoring, A/B testing infrastructure, and feature store management

- Proven track record of building and scaling high-performing engineering teams

- Strong analytical and problem-solving skills with ability to troubleshoot complex distributed systems

- Excellent communication skills with ability to present technical concepts to executive stakeholders

- Experience driving cross-functional initiatives and influencing without direct authority

- Demonstrated ability to balance operational excellence with strategic innovation

At DXC Technology, we believe strong connections and community are key to our success. Our work model prioritizes in-person collaboration while offering flexibility to support wellbeing, productivity, individual work styles, and life circumstances. We’re committed to fostering an inclusive environment where everyone can thrive.

Recruitment fraud is a scheme in which fictitious job opportunities are offered to job seekers typically through online services, such as false websites, or through unsolicited emails claiming to be from the company. These emails may request recipients to provide personal information or to make payments as part of their illegitimate recruiting process. DXC does not make offers of employment via social media networks and DXC never asks for any money or payments from applicants at any point in the recruitment process, nor ask a job seeker to purchase IT or other equipment on our behalf. More information on employment scams is available here.

Skills Required

  • 7+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure roles
  • 2+ years of management experience leading technical teams
  • Hands-on experience with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, or New Relic
  • Proficiency in Infrastructure as Code (Terraform, CloudFormation, Ansible)
  • Strong scripting and automation skills (Python, Go, Bash, PowerShell)
  • Familiarity with Databricks platform administration and workflow orchestration
  • Knowledge of UiPath platform architecture and bot deployment strategies
  • Experience with ML model monitoring and feature store management
  • Proven track record of building and scaling high-performing engineering teams

DXC Technology Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about DXC Technology and has not been reviewed or approved by DXC Technology.

  • Healthcare Strength Health coverage includes multiple national carrier options and plan types, with HSA eligibility where applicable. Feedback suggests the medical, dental, and vision lineup is broad and comparable to large-firm offerings.
  • Retirement Support A 401(k) program with employer matching and an annual true-up is available, with standard vesting provisions. This structure can help employees capture matching contributions over the year if contribution rates vary.
  • Leave & Time Off Breadth Flexible or “unlimited” vacation is offered for many U.S. roles instead of accrual-based PTO. Feedback suggests the approach can support work-life balance when team norms allow adequate time away.

DXC Technology Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Buenos Aires, Buenos Aires
86,261 Employees
Year Founded: 2017

What We Do

DXC Technology is a Fortune 500 global IT services leader. Our more than 130,000 people in 70-plus countries are entrusted by our customers to deliver what matters most. We use the power of technology to deliver mission critical IT services across the Enterprise Technology Stack to drive business impact. DXC is an employer of choice with strong values, and fosters a culture of inclusion, belonging and corporate citizenship. We are DXC.

Similar Jobs

In-Office
Chennai, Tamil Nadu, IND
86261 Employees

CSC Logo CSC

Accountant

Fintech • Legal Tech • Software • Financial Services • Cybersecurity • Data Privacy
Remote or Hybrid
Chennai, Tamil Nadu, IND
8500 Employees

Capco Logo Capco

Data Engineer

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
2 Locations
6000 Employees

Capco Logo Capco

IRR - Regulatory Reporting

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
India
6000 Employees

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account