Responsibilities
- Champion Reliability by Design: Collaborate with architects and engineers to build resilient, fault-tolerant systems across our evolving cloud-native stack.
- Observability Overhaul: Lead the charge on full-stack observability, leveraging modern APM tooling, meaningful SLOs/SLIs, and actionable alerts.
- Scaling Systems: Develop and implement auto-scaling strategies, load testing plans, and capacity forecasting for multi-tenant environments.
- Progressive Delivery: Help implement and automate deployment strategies such as canary releases, feature flags, and blue/green rollouts.
- Incident Response: Create and refine on-call processes, incident response playbooks, and blameless post-mortem routines.
- Monitoring & Tooling: Own and evolve our monitoring infrastructure, integrating metrics, logs, and traces into a cohesive ecosystem.
- Developer Empowerment: Build reusable templates, dashboards, and platform tooling to empower dev teams to “shift left” on reliability.
- Cross-functional Collaboration: Work hand-in-hand with Infrastructure, Architecture, Support, and Engineering teams to drive shared accountability for uptime and performance.
Skills
- 5+ years in SRE, DevOps, or Production Engineering roles, ideally within a SaaS or cloud-native environment.
- Deep experience with cloud platforms (preferably Azure or AWS), and Infrastructure-as-Code tools (e.g. Terraform).
- Hands-on experience with Azure DevOps is strongly preferred, as our CI/CD and project workflows are fully built around it.
- Proficiency with observability tools such as New Relic, Datadog, Prometheus, or similar.
- Strong understanding of software deployment strategies, CI/CD pipelines, and release engineering.
- Ability to code in at least one modern scripting or systems language (e.g., Python,PowerShell, Go, Bash).
- Experience operating multi-tenant environments with an emphasis on security, performance, and cost optimization.
- Excellent communicator who thrives in cross-functional settings and can influence engineering culture around reliability.
Desirable Skills
- Experience in regulated industries (e.g., financial services, healthcare).
- Background with service mesh architectures, distributed tracing, and gRPC/GraphQL.
- Familiarity with incident management platforms (e.g., PagerDuty, OpsGenie).
- Contributions to open-source SRE tooling or frameworks.
Top Skills
What We Do
StarCompliance is the world's leading provider of compliance software to the global financial industry. Our clients include asset managers, broker-dealers, private equity firms, insurance providers, investment banks, and diversified financial institutions. Our scalable, easy-to-use solutions provide a 360-degree view of employee and business activity to help firms monitor and reduce risk, meet regulatory obligations, gain efficiencies, and drive employee adoption.
Our Employee Conflicts of Interest suite provides clients a single place for monitoring and mitigating potential employee conflicts, covering: personal trading activity; insider trading; private investments, gifts and entertainment spending; outside business activities; and political donations. The STAR Mobile app supports personal trading pre-clearance requests and gifts and entertainment spending submissions, and allows compliance officers and employee supervisors to review and approve those requests and submissions on-the-go. Compliance Control Room centralizes all firm deal-related activity—automatically surfacing critical data that might otherwise be missed—and allowing for easier conflict searches, so deals can be cleared faster and with greater confidence.








