SafetyCulture is a global technology company that is helping to transform workplaces around the world. After witnessing the tragedy of workplace incidents as a private investigator, SafetyCulture Founder Luke Anear recruited a team to help him develop a mobile solution for frontline workers. What we have created is a market-leading workplace operations platform that helps give teams the knowledge, tools, and confidence they need to meet higher standards, work safely, and improve every day.
SafetyCulture is among the fastest-growing tech companies in Australia. Its bold ambition is to reach 100 million users worldwide by 2032. Opportunities to be part of a journey like this do not come around often.
What do Staff SRE’s at SafetyCulture do?
SRE’s at SafetyCulture are the champions of our platform's reliability, and as a Staff engineer you’ll be empowered to manage complex architectural decisions, solve cross domain challenges and to drive cultural change across engineering.
What will you be doing?
- Design, develop and support our Observability platform (Prometheus, Loki and Grafana Cloud);
- Work across engineering teams to define Service Level Objectives;
- Instrument Go microservices using OpenTelemetry;
- Write and maintain Go modules providing fundamental capabilities to our applications (e.g tracing and logging);
- Driving a culture around Incident Management and how we can learn and improve from them;
- Engaging with teams across Engineering on reliability and performance issues;
- Educating and driving the SRE mandate across the organisation.
The successful applicant will:
- Work closely with our wider engineering team to understand what reliability metrics will enable them to prioritise production stability, and where additional instrumentation will assist in diagnosing complex issues. You’ll be a go-to expert on using our observability platform to uncover the root cause of problems.
- Be managing and scaling an observability platform ingesting millions of metrics series, terabytes of trace and log data, and providing an opinionated stance on reliability through curated dashboards and SLOs.
- Evolve our existing Incident Management process and tooling, enabling all of our engineering teams to mitigate, learn and drive improvements when things go wrong.
- Partner closely with the Engineering Leadership Team to define and share key reliability findings based on production telemetry and incident reviews.
- Develop capabilities and tooling that enable our engineering community to clearly understand how their production services are running, and how they can diagnose where performance and reliability issues come from.
- Collaborate with engineering teams across SafetyCulture to help them instrument their services, understand their observability telemetry, and diagnose complex problems within our microservice architectures. You’ll have opportunities to directly contribute to reliability improvements, and to grow your passion within the SRE space.
You will have experience in:
- Expertise to operating Observability platforms at scale.
- Strong technical leadership in SRE concepts
- Knowledge of best practices for the full software development life cycle; including coding standards, code reviews, source control management, build processes, testing, and operations.
- Experience in designing and building complex software and at scale systems
Your professional background will comprise of:
- Tertiary degree in Computer Science or related technical field, or equivalent practical experience.
- 8+ years relevant experience in software development and mentorship experience.
- A solid understanding of monitoring, logging, tracing, and observability instrumentation.
- Experience working with observability platforms like Grafana / Datadog / New Relic / Honeycomb.
- A solid background in SRE concepts like SLOs.
- Experience in defining and driving a culture of Incident Management
- Proven experience of working on complex and large-scale projects that require high-level technical skills, creativity, and leadership.
- Proficiency with one or more general purpose programming languages including but not limited to: C#, Golang, C++, Python, Java, Typescript, Scala.
What Do I Get Access To When Working at SafetyCulture?
- Equity with high growth potential, and a competitive salary.
- Hybrid working; we encourage you to create the best work blend while working from your home and the local SafetyCulture office.
- Access to professional and personal training and development opportunities.
- Participation in hackathons, workshops, and lunch & learn sessions.
- Community involvement, open source work, attending talks and events, and experimenting with new technologies.
What are the office benefits?
- In-house Culinary Crew serving up daily breakfast, lunch, and snacks.
- Barista coffee machine, craft beer on tap, boutique wines, and a range of non-alcoholic beverages.
- Quarterly celebrations and team events.
- Table tennis, board games, book library, and pet-friendly office.
We’re committed to building inclusive teams and cultivating a sense of belonging so our people can bring their whole authentic selves to work each day. We seek to make reasonable adjustments throughout our recruitment process to create an even playing field for all candidates. Thanks to the tireless efforts of the entire SafetyCulture team, we’ve built an incredible culture which has seen us recognized as a Best Place to Work in Australia, the US, and the UK.
Even if you don't meet every requirement listed in the ad, please consider applying for this role. We prioritise inclusion and value individuals with potential over a checklist of qualifications. Don't rule yourself out—hit that apply button if this job resonates with you.
You can find out more about life at SafetyCulture via YouTube, Twitter, Instagram, and LinkedIn.
To all recruitment agencies, we do not accept resumes or partnership opportunities. Please do not forward resumes to SafetyCulture or any of our employees. We are not responsible for any fees associated with unsolicited resumes.
Top Skills
What We Do
SafetyCulture is a global technology company that puts the power of continuous improvement into everyone's hands. Our operations platform unlocks the power of observation at scale, giving leaders visibility and workers a voice in driving quality, efficiency, and safety improvements.
More than 60,000 customers use our operations platform to perform checks, train staff, report issues, and automate tasks. In doing so, we drive processes that help businesses get better every day.
Recent analysis by Forrester found that our flagship products provide a 214% return on investment for customers, and USD $3.6M in cost savings from operational improvements.
From top Australian ASX-listed grocer and retailer Coles and American aviation giant JetBlue, to Europe’s largest hospitality multi-national Accor, our operations platform is helping teams in every industry.