Key Responsibilities
- Conduct assessments of existing observability architectures to identify gaps and improvement opportunities.
- Design and implement scalable log aggregation pipelines for centralized and efficient data collection.
- Apply noise-reduction techniques to filter irrelevant or false-positive alerts, enhancing focus on actionable issues.
- Develop and maintain monitoring dashboards that deliver actionable insights across applications and infrastructure.
- Lead the migration from Lightstep to Honeycomb, ensuring seamless data pipeline transitions, OpenTelemetry alignment, and stakeholder adoption.
- Collaborate with infrastructure and product teams to integrate observability tooling into CI/CD workflows and cloud environments.
- Analyze telemetry data (metrics, logs, traces) to troubleshoot complex system behaviors and recommend improvements.
- Participate in production debugging and incident troubleshooting using telemetry data
- Mentor junior engineers on log management, event correlation, distributed tracing, and alert management.
- Stay current on observability innovations and recommend adoption strategies aligned with organizational goals.
- Support post-incident reviews and continuous improvement through data-driven root cause analysis.
- Drive continuous improvement in reliability and operational excellence through proactive observability initiatives.
Key Skills
- 6+ years of experience in software or systems engineering, with at least 3 years focused on observability or SRE practices.
- Hands-on experience with observability tools such as Honeycomb, VictoriaMetrics, Lightstep, Prometheus, Grafana, OpenTelemetry, Splunk, Datadog, or New Relic.
- Strong knowledge of OpenTelemetry instrumentation (metrics, traces, logs) and SLIs/SLOs for reliability tracking.
- Experience with distributed tracing, event correlation, and noise reduction frameworks.
- Proficiency in one or more programming/scripting languages such as Python, Java, Kotlin, Go, or Shell.
- Working knowledge of Infrastructure as Code (Terraform) and CI/CD (Jenkins, Github Actions,...) pipelines.
- Familiarity with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Strong analytical, troubleshooting, and communication skills with the ability to work effectively across teams.
- Experience conducting observability gap assessments and defining improvement plans.
- Experience working in complex or multi-cloud environments is preferred.
Cognite Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Cognite and has not been reviewed or approved by Cognite.
-
Affordable Benefits — Healthcare premiums for employees and dependents are often fully covered, reducing out‑of‑pocket costs. Feedback suggests this makes the total rewards feel competitive in key markets.
-
Parental & Family Support — Paid parental leave for primary and secondary caregivers is described as generous. This signals a family‑friendly approach that many consider a standout perk.
-
Leave & Time Off Breadth — Unlimited PTO with a company‑wide year‑end shutdown offers ample time away from work. Flexible time‑off options are consistently highlighted across U.S. roles.
Cognite Insights
Similar Jobs
What We Do
Cognite is an AI company that delivers industrial software to improve the production efficiency of Energy, Process Manufacturing, and other industrial companies. We deliver an Industrial DataOps platform that liberates siloed data and empowers our customers to solve some of their most complex business problems with AI-powered solutions. The typical solutions we enable drive innovative new ways to approach Data Exploration, Digital Operator Rounds, Production Optimization, Turnaround Planning, and Root Cause Analysis. We do this by automating and scaling industrial data contextualization of various sources (such as time series, engineering diagrams, equipment logs, maintenance records, 3D facility models, images, large point clouds, and more). We use AI and other tools to find and map the meaningful relationships between the data across these various sources. In addition, we provide intuitive tools that enable efficient use of analytics and automated workflows, as well as prebuilt AI capabilities and a low-code industrial agent builder, Cognite Atlas AI, that enables AI to carry out more complex operations with greater accuracy.
Why Work With Us
Employees at Cognite are pushing the envelope with the latest cloud technology, scaling industrial applications across hundreds of assets, revolutionizing industrial data models, and working with robotics. Cogniters are fast, creative, and resilient. We keep the energy high and fun, learning from our mistakes and celebrating our victories together.
Gallery






