TechBlocks

Architect - Observability and Chaos Engineering

Posted 2 Days Ago

Be an Early Applicant

Hyderabad, Telangana, IND

In-Office

Expert/Leader

Information Technology • Consulting

The Role

Design and implement observability frameworks and telemetry for AI/LLM and multi-agent systems. Configure Langfuse, AWS observability services, and MCP agent monitoring. Build logging, tracing, metrics, dashboards, and alerts. Design and run Chaos Engineering experiments with AWS FIS, analyze resilience, and support incident troubleshooting, RCA, and reliability improvements alongside DevOps, Platform, AI, and Security teams.

Summary Generated by Built In

Role: Observability & Chaos Engineering Specialist (Langfuse / AWS / MCP Agents)Role Overview

We are seeking an experienced Observability & Chaos Engineering Specialist to support monitoring, resilience, and operational excellence initiatives for AI-driven and cloud-native systems. The ideal candidate will have strong expertise in Langfuse, AWS native observability services, MCP agent-based environments, and Chaos Engineering using AWS Fault Injection Simulator (FIS).

The role focuses on building highly observable, resilient, and fault-tolerant distributed systems by implementing advanced monitoring, tracing, logging, and controlled failure testing practices.

Key Responsibilities

Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
Implement and optimize AWS native observability services, including:
- CloudWatch
- X-Ray
- CloudTrail
- OpenSearch / Logging frameworks
Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
Analyze system behavior under stress conditions and recommend architecture improvements
Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks

Required Skills & Qualifications

12+ years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
Strong hands-on experience with:
- Langfuse for AI/LLM observability
- AI workflow tracing and telemetry
Expertise in AWS native observability tools, including:
- CloudWatch
- AWS X-Ray
- CloudTrail
- AWS monitoring and logging services
Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
Strong understanding of:
- Distributed systems observability
- Telemetry pipelines
- Logging, tracing, and metrics collection
Hands-on experience with Chaos Engineering practices
Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
Knowledge of:
- Incident management and root cause analysis
- Reliability engineering and operational best practices
Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
Experience with CI/CD pipelines and infrastructure automation
Strong scripting/programming skills in Python or similar languages
Strong analytical, troubleshooting, and problem-solving skills

Skills Required

12+ years experience in Observability Engineering, SRE, DevOps, or Platform Engineering
Hands-on experience with Langfuse for AI/LLM observability
Experience with AI workflow tracing and telemetry
Expertise with AWS native observability tools (CloudWatch, X-Ray, CloudTrail, AWS monitoring/logging)
Experience with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
Strong understanding of distributed systems observability, telemetry pipelines, logging, tracing, and metrics
Hands-on experience with Chaos Engineering practices
Expertise using AWS Fault Injection Simulator (FIS)
Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
Experience with CI/CD pipelines and infrastructure automation
Strong scripting/programming skills in Python or similar languages
Knowledge of incident management, root cause analysis, and reliability engineering best practices

View all jobs at TechBlocks

View TechBlocks Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Ahmedabad, Gujarat

345 Employees

Year Founded: 2007

What We Do

@TechBlocks we power the software defined industries (SDI) of today and tomorrow. We are a software engineering and consulting firm. We build modern digital value chains and businesses reimagined to create frictionless experiences for innovative monetization methods and drive unforeseen efficiencies. We are known to build world class custom platforms and products that are cloud native for some of the worlds largest brands. We are the go to technology partners for born in digital businesses that grew with us from "Concept to Commercialization" and have revenues between $100M - $10B. We help modern businesses transition just from a technology outsourcing mentality to help create globally distributed digital COEs and mature them. Our converged COEs that we create in partnership with our clients help power software factories that are extremely dynamic. We have created modern digital COEs and factories that are created with a single minded goal to future proof our clients businesses. Everything we do is centred around two philosophies and practices - Design Thinking and Lean Engineering. Whether it is building digital commerce platforms, marketplace for worlds largest retailers or smart utilities applications and products or digital health products/platforms that power wearables, patches or devices across healthcare landscape; we do it all with speed and sophistication that is unmatched in the industry