Architect - Observability and Chaos Engineering

Posted 2 Days Ago
Be an Early Applicant
Hyderabad, Telangana, IND
In-Office
Expert/Leader
Information Technology • Consulting
The Role
Design and implement observability frameworks and telemetry for AI/LLM and multi-agent systems. Configure Langfuse, AWS observability services, and MCP agent monitoring. Build logging, tracing, metrics, dashboards, and alerts. Design and run Chaos Engineering experiments with AWS FIS, analyze resilience, and support incident troubleshooting, RCA, and reliability improvements alongside DevOps, Platform, AI, and Security teams.
Summary Generated by Built In
Role: Observability & Chaos Engineering Specialist (Langfuse / AWS / MCP Agents)Role Overview
We are seeking an experienced Observability & Chaos Engineering Specialist to support monitoring, resilience, and operational excellence initiatives for AI-driven and cloud-native systems. The ideal candidate will have strong expertise in Langfuse, AWS native observability services, MCP agent-based environments, and Chaos Engineering using AWS Fault Injection Simulator (FIS).
The role focuses on building highly observable, resilient, and fault-tolerant distributed systems by implementing advanced monitoring, tracing, logging, and controlled failure testing practices.
Key Responsibilities
  • Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
  • Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
  • Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
  • Implement and optimize AWS native observability services, including:
    • CloudWatch
    • X-Ray
    • CloudTrail
    • OpenSearch / Logging frameworks
  • Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
  • Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
  • Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
  • Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
  • Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
  • Analyze system behavior under stress conditions and recommend architecture improvements
  • Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
  • Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks
Required Skills & Qualifications
  • 12+ years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
  • Strong hands-on experience with:
    • Langfuse for AI/LLM observability
    • AI workflow tracing and telemetry
  • Expertise in AWS native observability tools, including:
    • CloudWatch
    • AWS X-Ray
    • CloudTrail
    • AWS monitoring and logging services
  • Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
  • Strong understanding of:
    • Distributed systems observability
    • Telemetry pipelines
    • Logging, tracing, and metrics collection
  • Hands-on experience with Chaos Engineering practices
  • Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
  • Knowledge of:
    • Incident management and root cause analysis
    • Reliability engineering and operational best practices
  • Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
  • Experience with CI/CD pipelines and infrastructure automation
  • Strong scripting/programming skills in Python or similar languages
  • Strong analytical, troubleshooting, and problem-solving skills

Skills Required

  • 12+ years experience in Observability Engineering, SRE, DevOps, or Platform Engineering
  • Hands-on experience with Langfuse for AI/LLM observability
  • Experience with AI workflow tracing and telemetry
  • Expertise with AWS native observability tools (CloudWatch, X-Ray, CloudTrail, AWS monitoring/logging)
  • Experience with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
  • Strong understanding of distributed systems observability, telemetry pipelines, logging, tracing, and metrics
  • Hands-on experience with Chaos Engineering practices
  • Expertise using AWS Fault Injection Simulator (FIS)
  • Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
  • Experience with CI/CD pipelines and infrastructure automation
  • Strong scripting/programming skills in Python or similar languages
  • Knowledge of incident management, root cause analysis, and reliability engineering best practices
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Ahmedabad, Gujarat
345 Employees
Year Founded: 2007

What We Do

@TechBlocks we power the software defined industries (SDI) of today and tomorrow. We are a software engineering and consulting firm. We build modern digital value chains and businesses reimagined to create frictionless experiences for innovative monetization methods and drive unforeseen efficiencies. We are known to build world class custom platforms and products that are cloud native for some of the worlds largest brands. We are the go to technology partners for born in digital businesses that grew with us from "Concept to Commercialization" and have revenues between $100M - $10B. We help modern businesses transition just from a technology outsourcing mentality to help create globally distributed digital COEs and mature them. Our converged COEs that we create in partnership with our clients help power software factories that are extremely dynamic. We have created modern digital COEs and factories that are created with a single minded goal to future proof our clients businesses. Everything we do is centred around two philosophies and practices - Design Thinking and Lean Engineering. Whether it is building digital commerce platforms, marketplace for worlds largest retailers or smart utilities applications and products or digital health products/platforms that power wearables, patches or devices across healthcare landscape; we do it all with speed and sophistication that is unmatched in the industry

Similar Jobs

DigitalOcean Logo DigitalOcean

Software Engineer

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Hyderabad, Telangana, IND
1400 Employees

DigitalOcean Logo DigitalOcean

Senior Software Engineer

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Hyderabad, Telangana, IND
1400 Employees

DigitalOcean Logo DigitalOcean

Senior Software Engineer

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Hyderabad, Telangana, IND
1400 Employees

DigitalOcean Logo DigitalOcean

Software Engineer

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Hyderabad, Telangana, IND
1400 Employees

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account