Lead Site Reliability Engineer - Observability

Posted 10 Days Ago
Be an Early Applicant
Hyderabad, Telangana, IND
In-Office
Senior level
Software
The Role
Lead SRE for observability: design and unify telemetry (metrics, traces, logs) across Azure environments, enable OpenTelemetry, implement IaC, SLOs, incident response, automation, and support AI/LLM telemetry and synthetic monitoring.
Summary Generated by Built In

WHAT MAKES US, US

Join some of the most innovative thinkers in FinTech as we lead the evolution of financial technology. If you are an innovative, curious, collaborative person who embraces challenges and wants to grow, learn and pursue outcomes with our prestigious financial clients, say Hello to SimCorp!
 

At its foundation, SimCorp is guided by our values – caring, customer success-driven, collaborative, curious, and courageous. Our people-centered organization focuses on skills development, relationship building, and client success. We take pride in cultivating an environment where all team members can grow, feel heard, valued, and empowered.

If you like what we’re saying, keep reading!

WHY THIS ROLE IS IMPORTANT TO US
SimCorp‘s Observability strategy is to deliver a consistent & coherent Observability approach across the full SimCorp One ecosystem. This includes different technology stacks, products and services across the organization. This is a requirement to be able to observe SimCorp products & services seamlessly & efficiently investigate emerging problems to provide high quality software to our clients as well as being able to stay within agreed resolution times. Also provide insights to KPIs, SLOs, SLAs and cost attribution.

As a Lead Site Reliability Engineer – Observability, you will blend site reliability engineering principles with deep telemetry expertise to ensure system visibility, uptime, and performance. Candidate must possess in-depth knowledge and expertise in telemetry data collection, analysis, and implementation. Fully understand the intricacies of and how to derive meaningful insights from different telemetry sources such as metrics, traces, logs and events.
Candidate will work closely with product management, architects and engineering teams to establish unified visibility across the full stack, from LLM‑driven agents to backend services. You won’t just monitor systems—you’ll define the patterns and tools that are a core part of empowering and driving SimCorp’s engineering culture. Your contributions will drive stability, continuous improvement, and operational excellence in our Azure-based environments. This role blends hands-on engineering, incident response, platform configuration, and service quality - guided by ITIL and SRE best practices.
WHAT YOU WILL BE RESPONSIBLE FOR

  • Support the operational and enhancement of mission-critical environments for both new and existing Cloud Native products & services.

  • Deploy and manage instrumentation for applications to gain granular insights into service health.

  • Assist engineering teams in implementing and maintaining metrics, logs, and traces for applications & infrastructure

  • Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Application Insights or equivalent).

  • Enable and configure OpenTelemetry-based data collection within Azure Monitor Application Insights by leveraging Azure Monitor OpenTelemetry Distro

  • Make sure AI agent frameworks adopt the semantic convention to ensure interoperability and consistency in observability data.

  • Work with product development teams to enable structured logging, basic distributed tracing, and core metrics.

  • Support incident response by gathering logs, metrics, and traces to perform root cause analysis using observability tools.

  • Build tools and automation to eliminate TOIL, improve engineering velocity, developer experience, and improve system reliability.

  • Define and manage SLOs and error budgets in partnership with Engineering teams.

  • Flexible working in regular & evening shift on rotational basis and provide weekend or On-Call support as needed.

  • Collaborate with Agile teams and take part in design discussions with clients, vendors, and stakeholders. 

  • Contribute to knowledge sharing across multiple Product Areas. 

  • Leverage a strong foundation in ITIL practices, including problem, change, and incident management

WHAT WE VALUE

  • Bachelor’s degree in Computer Science or related field (Master’s is a plus)

  • 5+ year experience in Site Reliability, Observability, DevOps, or Cloud Engineering roles

  • Must have expertise with Microsoft Azure Cloud.

  • Must have experience working with observability frameworks like Open Telemetry and distributed tracing systems

  • Expertise in Infrastructure as Code (IaC) using Bicep, ARM and Terraform.

  • Strong understanding of instrumenting, tracing, and correlating AI/LLM workflows with infrastructure telemetry.

  • Solid experience in monitoring and logging tools (Azure Monitor, Application Insights, DataDog, Log Analytics).

  • Knowledge of AI/ML-based anomaly detection, log aggregation and analysis tools like Microsoft Azure Anomaly Detector or equivalent.

  • Experience with Agentic/LLM‑based systems (like LangChain, Celery, OpenAI APIs, orchestration frameworks)

  • Experience working with application reliability platforms like Checkly or equivalent

  • Experience setting up synthetic monitoring using Playwright or equivalent

  • Solid Understanding of networking, containerization (Kubernetes, Docker)

  • Good understanding of APIs, scripting languages like PowerShell, Bash, Kusto and databases like SQL, Cosmos DB and Postgres SQL

  • Familiarity with SimCorp Dimension & Salesforce User is a plus

  • Proficiency in IT service management (ITSM) frameworks like ITIL, focusing on incident, change, and problem management to improve operational efficiency

  • Experience managing both onboarding projects and live production operations

  • Collaborative mindset and ability to work in cross-functional teams

  • Interest in continuous learning and growth within our Product Area

Benefits

  • Global hybrid work policy - We ask you to work 2 days a week from the office. If you choose you can work remotely the other days. Of course, you are welcome at the office if that is your preference.

  • Culture – Inclusive and diverse company culture 

  • Work-life balance – We believe that an equilibrium between professional responsibilities makes us all the best version of ourselves, both in private life and as colleagues in the workplace

  • Empowerment – We believe that all voices are valuable and must be heard. You will be involved in shaping our work processes

  • Career & Growth – Simcorp does offer opportunities for professional development: there is never just only one route - we offer an individual approach to professional development to support the direction you want to take.

NEXT STEPS

Please send us your application in English via our career site as soon as possible, we process incoming applications continually. Please note that only applications sent through our system will be processed. At SimCorp, we recognize that bias can unintentionally occur in the recruitment process. To uphold fairness and equal opportunities for all applicants, we kindly ask you to exclude personal data such as photos, age, or any non-professional information from your application. Thank you for aiding us in our endeavor to mitigate biases in our recruitment process.

For any questions you are welcome to contact Shweta Goyal ([email protected]), Talent Acquisition Partner. If you are interested in being a part of SimCorp but are not sure this role is suitable, submit your CV anyway. SimCorp is on an exciting growth journey, and our Talent Acquisition Team is ready to assist you discover the right role for you. The approximate time to consider your CV is three weeks.
We are eager to continually improve our talent acquisition process and make everyone’s experience positive and valuable. Therefore, during the process we will ask you to provide your feedback, which is highly appreciated.

WHO WE ARE

For over 50 years, we have worked closely with investment and asset managers to become the world’s leading provider of integrated investment management solutions. We are 3,000+ colleagues with a broad range of nationalities, education, professional experiences, ages, and backgrounds.
SimCorp is an independent subsidiary of the Deutsche Börse Group. Following the recent merger with Axioma, we leverage the combined strength of our brands to provide an industry-leading, full, front-to-back offering for our clients.

SimCorp is an equal opportunity employer and welcome applicants from all backgrounds, without regard to race, gender, age, disability, or any other protected status under applicable law. We are committed to building a culture where diverse perspectives and expertise are integrated into our everyday work. We believe in the continual growth and development of our employees, so that we can provide best-in-class solutions to our clients.

#LI-Hybrid

Skills Required

  • Bachelor's degree in Computer Science or related field
  • 5+ years experience in Site Reliability, Observability, DevOps, or Cloud Engineering roles
  • Expertise with Microsoft Azure Cloud
  • Experience with observability frameworks like OpenTelemetry and distributed tracing systems
  • Expertise in Infrastructure as Code using Bicep, ARM and Terraform
  • Strong understanding of instrumenting, tracing, and correlating AI/LLM workflows with infrastructure telemetry
  • Experience with monitoring and logging tools (Azure Monitor, Application Insights, DataDog, Log Analytics)
  • Knowledge of AI/ML-based anomaly detection tools (Microsoft Azure Anomaly Detector or equivalent)
  • Experience with Agentic/LLM-based systems (LangChain, Celery, OpenAI APIs, orchestration frameworks)
  • Experience with application reliability platforms like Checkly or equivalent
  • Experience setting up synthetic monitoring using Playwright or equivalent
  • Solid understanding of networking, containerization (Kubernetes, Docker)
  • Good understanding of APIs and scripting languages like PowerShell, Bash, Kusto
  • Experience with databases such as SQL, Cosmos DB and Postgres SQL
  • Proficiency in IT service management frameworks like ITIL (incident, change, problem management)
  • Experience managing both onboarding projects and live production operations
  • Master's degree (plus)
  • Familiarity with SimCorp Dimension and Salesforce
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Copenhagen
3,062 Employees
Year Founded: 1971

What We Do

SimCorp is a provider of industry-leading integrated investment management solutions for the global buy side. Founded in 1971, with more than 3,000 employees across five continents, we are a truly global technology leader who empowers 40 of the world’s top 100 financial companies through our integrated platform, services, and partner ecosystem. SimCorp is a subsidiary of Deutsche Boerse Group. For more information, see www.simcorp.com.

Similar Jobs

ServiceNow Logo ServiceNow

Senior Manager - Learning Operations, Staffing & Support, Learning Innovation Solution Studio (LISS)

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Hyderabad, Telangana, IND
29000 Employees

CrowdStrike Logo CrowdStrike

Artificial Intelligence Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
India
10000 Employees

CrowdStrike Logo CrowdStrike

Site Reliability Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
India
10000 Employees

CrowdStrike Logo CrowdStrike

Salesforce Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
India
10000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account