TechBlocks Jobs

Observability & Chaos Engineering Specialist

TechBlocks

Observability & Chaos Engineering Specialist

Reposted 15 Days Ago

Be an Early Applicant

Hyderabad, Telangana, IND

In-Office

Senior level

Information Technology • Consulting

The Role

Design and implement observability frameworks and telemetry for AI/agent-based, cloud-native systems; configure Langfuse and MCP agent monitoring; deploy AWS observability tools; run chaos experiments with AWS FIS; build dashboards, alerts, and runbooks; collaborate with DevOps, platform, AI, and security teams to improve resilience and troubleshoot incidents.

Summary Generated by Built In

Position Title: Observability & Chaos Engineering Specialist

Experience: 7 - 10 years

Job Location: Remote

Work Mode: Remote

Time Zone/Shift: Starts at 12:30 PM IST

Requirements:

Role: Observability & Chaos Engineering Specialist (Langfuse / AWS / MCP Agents)

Role Overview

We are seeking an experienced Observability & Chaos Engineering Specialist to support monitoring, resilience, and operational excellence initiatives for AI-driven and cloud-native systems. The ideal candidate will have strong expertise in Langfuse, AWS native observability services, MCP agent-based environments, and Chaos Engineering using AWS Fault Injection Simulator (FIS).

The role focuses on building highly observable, resilient, and fault-tolerant distributed systems by implementing advanced monitoring, tracing, logging, and controlled failure testing practices.

Key Responsibilities

- Design and implement observability frameworks for AI/agent-based systems and distribute cloud-native applications
- Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
- Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
- Implement and optimize AWS native observability services, including:

Cloud Watch

AWS X-Ray

Cloud Trail

AWS Monitoring and logging services

- Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
- Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
- Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
- Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
- Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
- Analyze system behavior under stress conditions and recommend architecture improvements
- Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
- Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks

Required Skills & Qualifications

- 7-9 years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering

Strong hands-on experience with:

- Langfuse for AI/LLM observability
- AI workflow tracing and telemetry

Expertise in AWS native observability tools, including:

- CloudWatch
- AWS X-Ray
- CloudTrail
- AWS monitoring and logging services

Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks

Strong understanding of:

Distributed systems observability

Telemetry pipelines

Logging, tracing, and metrics collection

- Hands-on experience with Chaos Engineering practices
- Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing

Knowledge of:

Incident management and root cause analysis

Reliability engineering and operational best practices

- Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
- Experience with CI/CD pipelines and infrastructure automation
- Strong scripting/programming skills in Python or similar languages
- Strong analytical, troubleshooting, and problem-solving skills

Preferred / Nice-to-Have Skills

Experience with:
- OpenTelemetry
- Grafana
- Prometheus
- ELK/OpenSearch stack
Familiarity with:
- AI/LLM platforms and agentic architectures
- Event-driven and microservices-based systems
- Knowledge of:
- DevSecOps and cloud security monitoring
- Performance engineering and load testing
- AWS certifications preferred
- Experience working in highly regulated or enterprise-scale environments

About Us:

We are a global, cloud-native organization with a strong presence across North America and India, delivering innovative digital transformation solutions to clients across diverse industries such as Financial Services, Healthcare, Retail & E-commerce, Manufacturing, and Technology. Our strong client base includes Fortune 500 enterprises as well as high-growth mid-market and startup organizations, giving our teams exposure to a wide variety of business challenges and cutting-edge solutions.

Our technology practices are built around modern, future-ready capabilities including Cloud Engineering, Data & Analytics, AI/ML, Digital Experience Platforms, Application Modernization, and Enterprise Solutions such as SAP and other leading platforms. We follow a design thinking-led approach combined with agile and lean engineering practices to deliver scalable, high-impact solutions. Backed by globally recognized certifications such as ISO 27001, SOC 1, SOC 2, SOC 3, UK Cyber Essentials Plus, and CMMI Level 3, we ensure the highest standards of security, compliance, process maturity, and quality across all our engagements.

Why Join Us:

- Opportunity to work on global projects and Fortune 500 clients
- Exposure to cutting-edge technologies
- Strong learning, mentorship, and career growth programs
- Collaborative and innovation-driven work culture

If you are passionate about working on innovative technologies and want to be part of a fast-growing organization, we encourage you to apply and be part of our journey.

Company Details:

Website: http://tblocks.com

LinkedIn: https://www.linkedin.com/company/techblocks/about/

Skills Required

7-9 years experience in Observability Engineering, SRE, DevOps, or Platform Engineering
Hands-on experience with Langfuse for AI/LLM observability
AI workflow tracing and telemetry implementation experience
Expertise with AWS native observability tools (CloudWatch, AWS X-Ray, CloudTrail, logging/monitoring services)
Experience with MCP (Model Context Protocol) agent setups or multi-agent orchestration
Strong understanding of distributed systems observability, telemetry pipelines, logging, tracing, and metrics
Hands-on Chaos Engineering experience
Experience using AWS Fault Injection Simulator (FIS)
Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
Experience with CI/CD pipelines and infrastructure automation
Strong scripting/programming skills in Python or similar languages
Incident management, root cause analysis, and reliability engineering knowledge
Experience with OpenTelemetry
Experience with Grafana and Prometheus
Experience with ELK/OpenSearch stack
Familiarity with AI/LLM platforms, agentic architectures, DevSecOps, and performance/load testing
AWS certifications and experience in regulated or enterprise-scale environments

View all jobs at TechBlocks

View TechBlocks Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Vaughan, Ontario

345 Employees

Year Founded: 2007

What We Do

@TechBlocks we power the software defined industries (SDI) of today and tomorrow. We are a software engineering and consulting firm. We build modern digital value chains and businesses reimagined to create frictionless experiences for innovative monetization methods and drive unforeseen efficiencies. We are known to build world class custom platforms and products that are cloud native for some of the worlds largest brands. We are the go to technology partners for born in digital businesses that grew with us from "Concept to Commercialization" and have revenues between $100M - $10B. We help modern businesses transition just from a technology outsourcing mentality to help create globally distributed digital COEs and mature them. Our converged COEs that we create in partnership with our clients help power software factories that are extremely dynamic. We have created modern digital COEs and factories that are created with a single minded goal to future proof our clients businesses. Everything we do is centred around two philosophies and practices - Design Thinking and Lean Engineering. Whether it is building digital commerce platforms, marketplace for worlds largest retailers or smart utilities applications and products or digital health products/platforms that power wearables, patches or devices across healthcare landscape; we do it all with speed and sophistication that is unmatched in the industry