- Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
- Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
- Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
- Implement and optimize AWS native observability services, including:
- CloudWatch
- X-Ray
- CloudTrail
- OpenSearch / Logging frameworks
- Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
- Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
- Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
- Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
- Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
- Analyze system behavior under stress conditions and recommend architecture improvements
- Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
- Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks
- 7-9 years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
- Strong hands-on experience with:
- Langfuse for AI/LLM observability
- AI workflow tracing and telemetry
- Expertise in AWS native observability tools, including:
- CloudWatch
- AWS X-Ray
- CloudTrail
- AWS monitoring and logging services
- Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
- Strong understanding of:
- Distributed systems observability
- Telemetry pipelines
- Logging, tracing, and metrics collection
- Hands-on experience with Chaos Engineering practices
- Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
- Knowledge of:
- Incident management and root cause analysis
- Reliability engineering and operational best practices
- Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
- Experience with CI/CD pipelines and infrastructure automation
- Strong scripting/programming skills in Python or similar languages
- Strong analytical, troubleshooting, and problem-solving skills
- Experience with:
- OpenTelemetry
- Grafana
- Prometheus
- ELK/OpenSearch stack
- Familiarity with:
- AI/LLM platforms and agentic architectures
- Event-driven and microservices-based systems
- Knowledge of:
- DevSecOps and cloud security monitoring
- Performance engineering and load testing
- AWS certifications preferred
- Experience working in highly regulated or enterprise-scale environments
Skills Required
- 7-9 years experience in Observability Engineering, SRE, DevOps, or Platform Engineering
- Hands-on experience with Langfuse for AI/LLM observability
- AI workflow tracing and telemetry implementation experience
- Expertise with AWS native observability tools (CloudWatch, AWS X-Ray, CloudTrail, logging/monitoring services)
- Experience with MCP (Model Context Protocol) agent setups or multi-agent orchestration
- Strong understanding of distributed systems observability, telemetry pipelines, logging, tracing, and metrics
- Hands-on Chaos Engineering experience
- Experience using AWS Fault Injection Simulator (FIS)
- Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
- Experience with CI/CD pipelines and infrastructure automation
- Strong scripting/programming skills in Python or similar languages
- Incident management, root cause analysis, and reliability engineering knowledge
- Experience with OpenTelemetry
- Experience with Grafana and Prometheus
- Experience with ELK/OpenSearch stack
- Familiarity with AI/LLM platforms, agentic architectures, DevSecOps, and performance/load testing
- AWS certifications and experience in regulated or enterprise-scale environments
What We Do
@TechBlocks we power the software defined industries (SDI) of today and tomorrow. We are a software engineering and consulting firm. We build modern digital value chains and businesses reimagined to create frictionless experiences for innovative monetization methods and drive unforeseen efficiencies. We are known to build world class custom platforms and products that are cloud native for some of the worlds largest brands. We are the go to technology partners for born in digital businesses that grew with us from "Concept to Commercialization" and have revenues between $100M - $10B. We help modern businesses transition just from a technology outsourcing mentality to help create globally distributed digital COEs and mature them. Our converged COEs that we create in partnership with our clients help power software factories that are extremely dynamic. We have created modern digital COEs and factories that are created with a single minded goal to future proof our clients businesses. Everything we do is centred around two philosophies and practices - Design Thinking and Lean Engineering. Whether it is building digital commerce platforms, marketplace for worlds largest retailers or smart utilities applications and products or digital health products/platforms that power wearables, patches or devices across healthcare landscape; we do it all with speed and sophistication that is unmatched in the industry







