Staff Systems Reliability Engineer at The Walt Disney Company
Systems Reliability Engineers use a software engineering approach to architect, design, automate, monitor, and build applications at scale. This includes operating and engineering software with close business segment alignment to deliver platforms through efficient, effective and resilient architectures. SREs are talented engineers that are focused on improving quality through a data driven approach: instrumentation, automation, and functional/unit testing. The Staff SRE for the Foundational Services team will lead the other SREs on the Foundational Services team.
This position is for an experienced systems reliability engineer (SRE) eager to play an integral role on the SRE Foundational Services team for The Walt Disney Company to help elevate SRE practices, onboard new technologies, solve complex problems and integrate next generation digital platforms.
The Staff SRE will help create, build and deliver amazing experiences for our guests, fans and businesses. Primary responsibilities include helping existing, new and emerging business teams onboard new technologies or platforms to accelerate their businesses. This will include consultation, designing, building, and supporting development pipelines, automating infrastructure and operations, creating telemetry for monitoring, engineering high reliability and reinforcing best practices to secure our company and guest data.
The Staff SRE is expected to have expert level systems administration skills in Linux and Windows platforms, and must have experience with software development (e.g. Python, Go, Java, Node), CI Pipeline tools (e.g.Gitlab CI Jenkins), Git source management, cloud hosting (AWS, GCP & Azure), container computing (e.g. Docker, Serverless Technologies), web technologies and the DevOps team culture. This position will also bring expertise on systems, network, operational excellence and application stability, security, performance, and capacity management, as well as documentation.
The Staff SRE must be prepared to work with engineering, creative and production teams in an extremely collaborative and high-energy environment to brainstorm, architect, gather requirements, troubleshoot, and provide stellar customer support. The ideal Staff SRE is passionate about constantly learning, taking technology to the next level to solve complex problems, and is a highly motivated, optimistic, proactive, creative thought leader and project manager and working closely with our Business Units & Segments.
- (Architecture) - Leads the architecture for a complete inter-connected set of applications that takes into account future industry direction and business product alignment.
- (Collaboration) - Communicate effectively with executive management.
- (Collaboration) - Forms partnerships with other Staff and Sr. Staff members to see where they can drive cross-team efficiencies.
- (Communication) - Tracks, communicates, and improves time spent resolving operational issues.
- (Reliability Engineering) - Working on designing architecture that gracefully fails and advocate for the integration of those solutions into the software products.
- (Security) - Ensure application communication and data practices are following security best practices.
- (Systems Integration) - Guards infrastructure against the introduction of unnecessarily complex solutions
- (Software Engineering) - designs tools that facilitate ease of management and operations of applications, systems, and infrastructure on behalf of product teams
- (Quality Engineering) - contribute to the design or requirements definition for Quality Assurance tools and tool chains into SRE development and workflow processes specific to configuration management, orchestration, and tool chain development and support
Basic Qualifications :
- Typically has 7 or more years of experience with relevant internet technologies and with implementing, administering, and supporting production websites and backend support systems.
- Expertise in multiple scripting languages and advanced skills in programming languages (e.g. Go, Python, Ruby, Dart, Node, Java, others alike) with ability to build test coverage for all software being developed.
- Systems administration skills on Linux and Windows platforms
- Has experience automating the operations of large systems (Chef, Terraform, Ansible, etc)
- Networking skills and protocols (e.g. HTTP, TLS, SSH, DNS)
- Software Development Continuous Integration (CI) Pipeline knowledge (e.g. Jenkins, Gitlab CI)
- Has experience administering Splunk of other large Enterprise solutions (Websphere, SAP, etc)
- Expertise with Distributed Systems and Container Platforms (e.g. Kubernetes/GKE, ECS, Mesos, Fargate, Nomad)
- Experience with Source Control Management systems (e.g. Git)
- Expertise in public and private cloud hosting services (AWS, Google Cloud, Azure)
- Recognized as a subject matter expert on at least one OS and proficient in multiple operating systems, including OS performance monitoring, setup, configuration, tuning, and troubleshooting.
- Proficient in web server technologies (e.g. Apache, Node.js, NginX, Tomcat, IIS, Caddy Server) including setup, configuration, performance monitoring, tuning, clustering, and debugging (e.g. JConsole).
- Proficient with data technologies (e.g. NoSQL, MySQL, MongoDB, Redis, Elastic) including being able to perform basic setup, configuration, and troubleshooting.
- Able to implement existing base standards for new systems and/or applications for all of the following:
- Site/Systems monitoring and instrumentation
- Application monitoring and instrumentation
- System monitoring and instrumentation
- Resilience, performance & Telemetry data
- Able to diagnose simple to complex systems and process problems.
- Able to perform and provide in depth analysis on load test runs against a moderately complex system.
- Demonstrate exceptional troubleshooting methodology, including the ability to author and instruct new methodologies to the SRE team.
- Independently resolve moderately to highly complex system and application incidents.
- Able to identify and propose system and application fixes for performance bottlenecks.
- Able to evaluate new application requirements for capacity and run-time best practices.
- Able to evaluate new system and/or infrastructure solutions for technical feasibility against known requirements and standards.
- Effective at dealing with change: Able to transition in role or handle a significant modification or technology with minimal ramp-up time and with very little guidance.
- Documentation: Creation of Application Infrastructure Design documents, Operational Run books, and Knowledge Base Articles
- Interprets internal or external business strategies, opportunities and trends and recommends best practices for the business
- Solves complex problems; takes a broad perspective to identify innovative solutions
- Works independently, with guidance in only the most complex situations
- May lead functional teams or projects
- Provides mentoring to junior members of the engineering Teams.
- Influences the direction and adoption of technology across multiple engineering teams and businesses.
- Able to present technical subjects to both technical and non-technical audiences, large forums, and executives
Additional Information :