Rossum

Senior DevOps / SRE Engineer

Reposted 25 Days Ago

Be an Early Applicant

Prague, CZE

Hybrid

Senior level

Artificial Intelligence • Machine Learning • Software

The Role

Lead SRE initiatives to improve reliability, scalability, and performance across distributed systems. Build automation and infrastructure, enhance observability and incident response, tune platform performance and costs, resolve complex production issues, and document runbooks. Collaborate with Platform, Security, AI Platform, and Product teams to embed SRE best practices and drive durable improvements.

Summary Generated by Built In

We are looking for a Senior Site Reliability Engineer to join our Site Reliability Engineering (SRE) team. In this role, you'll drive the reliability, scalability, and performance of our platform, ensuring our systems remain stable as we grow. We value innovation and are seeking someone eager to bring fresh ideas – especially around building automation that reduces manual effort and improving distributed systems resilience.

This isn't a top-down organization; our engineers are the ones who flag technical challenges and design the solutions. You will collaborate closely with Platform Engineering, Security, AI Platform, and Product teams to design durable systems and make data-driven operational decisions.

What You'll Do

Collaborate with Engineering, Platform, and Security teams to embed SRE best practices early in system design.
Lead advancements in observability, monitoring, alerting, and incident-response workflows.
Analyze platform performance to contribute to cost-optimization, performance tuning, and resilience planning.
Build infrastructure and automation tooling that improves platform reliability and enhances deployment safety.
Diagnose and resolve complex production issues across distributed systems, and drive open post-incident reviews so failures translate into durable improvements.
Strengthen system consistency and author clear, concise documentation for runbooks and operational processes.

Who You Are

4+ years of experience in SRE, DevOps, platform engineering, or similar production-facing roles.
Strong problem-solving and debugging skills in distributed systems to maintain higher platform stability.
Eager to share operational guidelines, champion SRE practices across teams, and openly discuss what we can learn from system failures.
Excellent communication skills (English is our default language) with a genuine, collaborative approach to working across diverse engineering teams.
Strong hands-on experience with cloud environments (AWS, GCP, or similar) and proficiency with infrastructure-as-code and CI/CD pipelines.
Familiarity with Kubernetes (or container orchestration), event-driven architectures, or supporting ML/AI workloads and GPU infrastructure.

What Success Looks Like:

Within 3 Months:

Fully onboarded into the Rossum ecosystem, gaining a deep understanding of our infrastructure, observability stack, and SRE processes while building relationships across the team.
Gaining a deep understanding of our synergy with Coupa and our shared roadmap.
Initial Impact Goal: Improve a small reliability issue or add value to an existing automation or monitoring area.

Within 6 Months:

Independently managing key responsibilities, owning recurring reliability tasks, and identifying areas for strategic improvement.
Actively participating in the alignment of processes within the new Coupa organizational structure.
Operational KPI: Implement measurable enhancements to alert quality, CI/CD reliability, or service health metrics.

Within 12 Months:

Recognized as a subject matter expert within the team, navigating the global Coupa ecosystem.
Successfully contributing to Rossum's mission at a massive scale using new global resources.
Long-Term Strategic Goal: Lead a major reliability or infrastructure initiative, providing technical recommendations to guide our long-term reliability strategy.

Why Join Us?

At Rossum, we're on a mission to free the world from boring manual data entry. Our AI platform helps companies save millions of hours, allowing professionals to focus on creative, impactful work.

In an exciting move for our future, we have joined forces with Coupa, the world's leading unified platform for Business Spend Management. By combining Rossum's cutting-edge document AI with Coupa's global ecosystem, we are uniquely positioned to redefine how businesses operate at a massive scale. You can read more about this exciting milestone and our shared vision in the official announcement here.

What sets us apart?

Cutting-edge AI technology reshaping how businesses operate globally.
A collaborative, supportive environment where autonomy thrives.
Opportunities to grow in a fast-scaling company.
A culture that values diversity, empathy, and genuine connection.

As part of the Coupa family, you'll enjoy the agility of a fast-moving, innovation-focused team with the stability and reach of a global market leader. For you, this means an even greater opportunity to make an impact, access new global markets, and grow your career within a collaborative culture that values autonomy, diversity, and genuine connection. Together, we're not just automating data—we're giving time back to the world's professionals.

What we offer (Benefits)

Work-Life Harmony: 5 weeks of vacation, 5 sick/personal days, and a Birthday Day Off to celebrate you. For new parents, we offer an extra 18 weeks of fully paid Maternity leave and 8 weeks of fully paid Paternity leave.
Flexibility: Hybrid work model with flexible hours—find the balance that works best for you and work how you work best.
Tech & Setup: High-end laptop (MacBook) and tech setup. We also support you with personal development, including language lessons (English & Czech, on all levels).
Community & Well-being: Wellness Days (company-wide days to unplug, reset, and recharge), MultSport cards, regular team offsites, and meetups, in a friendly, ambitious team environment.

Ready to make an impact in your next role? Apply now!

Skills Required

4+ years of experience in SRE, DevOps, platform engineering, or similar production-facing roles
Strong problem-solving and debugging skills in distributed systems
Excellent communication skills (English)
Strong hands-on experience with cloud environments (AWS, GCP, or similar)
Proficiency with infrastructure-as-code and CI/CD pipelines
Familiarity with Kubernetes or other container orchestration
Familiarity with event-driven architectures, supporting ML/AI workloads, or GPU infrastructure

View all jobs at Rossum

View Rossum Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Prague

188 Employees

Year Founded: 2017

What We Do

Rossum solves four key steps in document-based processes... receiving documents across multiple channels, automated understanding, two-way communication to resolve exceptions, and acting on the data using in-depth integrations. In typical real-world scenarios, Rossum’s proprietary AI engine outranks narrow data extraction solutions in accuracy. Meanwhile, Rossum’s platform automates the document-based communication process end-to-end. Rossum’s goal for every use case is at minimum a 90% document processing speed increase. What does Rossum bring to the table? Zero-friction deployment: See high AI accuracy right out of the box in Rossum’s free trial and cut down on most maintenance effort thanks to cloud hosting and automated self-learning. Highly customizable: Implement powerful configuration APIs while enterprise users can engage Rossum’s dedicated Global Services team. Unified document gateway: Solve everything from security and compliance to IT and user training in one place by adopting a universally capable document solution. End-to-end solution: Rossum’s cloud platform takes care of the entire document lifecycle from receiving to internal IT systems posting. Security and compliance: Rossum is ISO 27001 certified and HIPAA compliant. The cloud service has been specifically engineered for high availability, with enterprise-grade SLAs ranging up to a 99.9% uptime guarantee and 24/7 support