SRE/Deployment Team Lead

Sorry, this job was removed at 09:03 p.m. (CST) on Wednesday, Aug 21, 2024
Hiring Remotely in US
Remote
5-7 Years Experience
Machine Learning
The Role

Comet is accelerating the machine learning development process for data science and ML teams. From the individual data scientist tracking training runs to the enterprise team moving hundreds of models into production, Comet is the platform used by some of the most innovative builders in the industry. We started Comet to make it possible for teams to manage and optimize models across the complete ML lifecycle and achieve business value faster. 

Working in Comet’s fast, dynamic startup environment is challenging and fun. We are looking for people who are customer-focused, work collaboratively, and want to be a voice in advancing Comet’s leadership in the marketplace. If you are excited about empowering technology innovators around the globe in creating world-changing machine learning models, Comet is the right place for you.

Comet is backed by more than $63 million in venture-capital funding, and we are the MLOps platform of choice for teams at Ancestry, The RealReal, Uber, WorkFusion, and Zappos. We are a remote-first company with offices in New York City (U.S.A.)  and Tel-Aviv (Israel). And we’re just getting started. CRN featured Comet as one of the 10 hottest machine learning and data science startups in 2021.

Comet is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees without regard to race, religion, color, sex, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship status, uniform service member status, marital status, pregnancy, age, medical condition, physical or mental disability, genetic information/characteristics, and any other characteristic protected by State or Federal law.

We are seeking an experienced and dynamic to join our growing team. The ideal candidate will have a strong background in software engineering, system administration, and a passion for automation, reliability, and performance. , with some flexibility in work hours required to collaborate with a global team based in Tel Aviv and Europe. As a lead, you will be responsible for designing, implementing, and maintaining our deployment, ensuring the stability and scalability of our infrastructure, and leading a team of talented engineers.

  • Oversee the deployment, monitoring, and maintenance of production systems.
  • Develop and maintain all deployment options for Comet, including multi-cloud, on-premises, and bare-metal deployments, using Linux single server or containerization technologies such as Kubernetes.
  • Quickly identify and resolve infrastructure bugs, ensuring high system availability and reliability.
  • Implement and maintain infrastructure as code using tools such as Terraform, Ansible, or similar.
  • Ensure high availability, scalability, and reliability of services and applications.
  • Work closely with customers to understand their deployment needs and provide effective support for deploying and maintaining Comet on their infrastructure.
  • Collaborate with cross-functional teams, including development, QA, support, and other teams, to ensure seamless integration and successful deployment of new features and updates.
  • Mentor and lead a team of DevOps, SRE, and deployment engineers.
  • Conduct regular performance tuning, troubleshooting, and root cause analysis.
  • Stay updated with the latest industry trends, technologies, and best practices in DevOps and SRE.
  • Implement and manage observability tools for monitoring, logging, and alerting.

  • 5+ years of experience in a DevOps, SRE, or similar role is a MUST.
  • At least 1-2 years of proven experience , leading and mentoring a team of engineers - is a MUST.
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proficient in Linux system internals, scripting, and configuration management tools (Bash/Python/Ansible).
  • Strong expertise in cloud platforms such as AWS, GCP, or Azure.
  • Proficiency in scripting languages (e.g., Python, Bash, Go).
  • Experience with containerization and orchestration tools such as Docker and Kubernetes.
  • Familiarity with cloud-based infrastructure services such as EC2, RDS, S3, and VPC, and with related tools such as CloudFormation and Terraform.
  • In-depth knowledge of CI/CD tools like Jenkins, GitLab CI, CircleCI, or similar.
  • Experience with monitoring applications such as Prometheus, Grafana, or ELK stack.
  • Solid understanding of networking concepts, security best practices, and system architecture.
  • Excellent communication skills, both verbal and written, to effectively collaborate with team members and clients.
  • Passionate about troubleshooting and investigating in unfamiliar environments.
  • Excellent problem-solving skills and the ability to work under pressure.

  • Experience with micro services architecture and server less computing.
  • Knowledge of configuration management tools (e.g., Chef, Puppet).
  • Understanding of database management and optimization.
  • Certifications in relevant technologies or platforms (e.g., AWS Certified DevOps Engineer).

  • Competitive salary - $200-250k based on proven experience, skills and location.
  • Competitive benefits package.
  • Flexible working hours and remote work options.
  • Opportunities for professional growth and development.
  • A collaborative and innovative work environment.
  • The chance to work with cutting-edge technologies and projects..
The Company
HQ: New York, NY
87 Employees
On-site Workplace
Year Founded: 2017

What We Do

Comet is a meta machine learning platform designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining and connecting the machine learning model lifecycle. By leveraging Comet, users can employ machine learning experiment tracking to track, compare, explain and reproduce their models. Backed by thousands of users and multiple Fortune 100 companies, Comet provides insights and data to build better, more accurate AI models while improving productivity, collaboration and visibility across teams.

Jobs at Similar Companies

Halter Logo Halter

Business Development Executive (Southland)

Hardware • Information Technology • Internet of Things • Machine Learning • Software • Business Intelligence • Agriculture
Easy Apply
Remote
Southland, NZL
150 Employees

Bectran, Inc Logo Bectran, Inc

IT Project Manager

Artificial Intelligence • Fintech • Information Technology • Machine Learning • Software • Automation
Schaumburg, IL, USA
51 Employees

JuiceMedia.AI Logo JuiceMedia.AI

Business Development Manager - Mobile applications

AdTech • Agency • Digital Media • Machine Learning • Marketing Tech • Analytics • Big Data Analytics
Hybrid
Marina del Rey, CA, USA
50 Employees
102K-167K Annually

Similar Companies Hiring

Bectran, Inc Thumbnail
Software • Machine Learning • Information Technology • Fintech • Automation • Artificial Intelligence
Schaumburg, IL
51 Employees
JuiceMedia.AI Thumbnail
Marketing Tech • Machine Learning • Digital Media • Big Data Analytics • Analytics • Agency • AdTech
Marina Del Rey, CA
50 Employees
Halter Thumbnail
Software • Machine Learning • Internet of Things • Information Technology • Hardware • Business Intelligence • Agriculture
Auckland City, NZ
150 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account