- Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
- Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
- Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
- Implement and optimize AWS native observability services, including:
- CloudWatch
- X-Ray
- CloudTrail
- OpenSearch / Logging frameworks
- Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
- Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
- Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
- Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
- Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
- Analyze system behavior under stress conditions and recommend architecture improvements
- Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
- Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks
- 12+ years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
- Strong hands-on experience with:
- Langfuse for AI/LLM observability
- AI workflow tracing and telemetry
- Expertise in AWS native observability tools, including:
- CloudWatch
- AWS X-Ray
- CloudTrail
- AWS monitoring and logging services
- Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
- Strong understanding of:
- Distributed systems observability
- Telemetry pipelines
- Logging, tracing, and metrics collection
- Hands-on experience with Chaos Engineering practices
- Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
- Knowledge of:
- Incident management and root cause analysis
- Reliability engineering and operational best practices
- Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
- Experience with CI/CD pipelines and infrastructure automation
- Strong scripting/programming skills in Python or similar languages
- Strong analytical, troubleshooting, and problem-solving skills
Skills Required
- 12+ years experience in Observability Engineering, SRE, DevOps, or Platform Engineering
- Hands-on experience with Langfuse for AI/LLM observability
- Experience with AI workflow tracing and telemetry
- Expertise with AWS native observability tools (CloudWatch, X-Ray, CloudTrail, AWS monitoring/logging)
- Experience with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
- Strong understanding of distributed systems observability, telemetry pipelines, logging, tracing, and metrics
- Hands-on experience with Chaos Engineering practices
- Expertise using AWS Fault Injection Simulator (FIS)
- Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
- Experience with CI/CD pipelines and infrastructure automation
- Strong scripting/programming skills in Python or similar languages
- Knowledge of incident management, root cause analysis, and reliability engineering best practices
What We Do
@TechBlocks we power the software defined industries (SDI) of today and tomorrow. We are a software engineering and consulting firm. We build modern digital value chains and businesses reimagined to create frictionless experiences for innovative monetization methods and drive unforeseen efficiencies. We are known to build world class custom platforms and products that are cloud native for some of the worlds largest brands. We are the go to technology partners for born in digital businesses that grew with us from "Concept to Commercialization" and have revenues between $100M - $10B. We help modern businesses transition just from a technology outsourcing mentality to help create globally distributed digital COEs and mature them. Our converged COEs that we create in partnership with our clients help power software factories that are extremely dynamic. We have created modern digital COEs and factories that are created with a single minded goal to future proof our clients businesses. Everything we do is centred around two philosophies and practices - Design Thinking and Lean Engineering. Whether it is building digital commerce platforms, marketplace for worlds largest retailers or smart utilities applications and products or digital health products/platforms that power wearables, patches or devices across healthcare landscape; we do it all with speed and sophistication that is unmatched in the industry






