Senior Software Engineer - Ceph

Reposted 22 Days Ago
Be an Early Applicant
Toronto, ON
Hybrid
150K-250K Annually
Senior level
Artificial Intelligence • Machine Learning
The Role
The role involves managing and maintaining Ceph clusters for a deep learning data center, integrating infrastructure, and troubleshooting related systems.
Summary Generated by Built In
Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior Software Engineer with deep expertise in managing Ceph for our deep learning datacenter in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Infiniband, NVIDIA deepops, Layer 3 networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration. 

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 25PB of disk and over 5PB flash storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating Ceph and its integration with a broad range of infrastructure technologies and hardware systems.

You MUST have prior Ceph experience in order to qualify for the job. If you don't, please don't spam the ATS.

A day in the life:

  • Design, manage and maintain large storage arrays
  • Integrate them with Deep Learning infrasturcture
  • Support troubleshooting for MAAS, Slurm and Kubernetes as needed
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in maintaining Ceph clusters
  • Experience with high performance computing is highly desirable
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Top Skills

Ceph
Infiniband
Layer 3 Networking
Linux
Maas
Nvidia Deepops
Python
Slurm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Santa Clara,, CA
21 Employees
Year Founded: 2023

What We Do

We are transforming how stories are told, knowledge is learned, and insights are gathered

Similar Jobs

Magna International Logo Magna International

Electronic Technologist - Midnight shift

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
Tecumseh, ON, CAN

Magna International Logo Magna International

Weld Specialist

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
Newmarket, ON, CAN

Magna International Logo Magna International

Electronic Technologist

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
Tecumseh, ON, CAN

General Motors Logo General Motors

HR/LR Business Partner

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
Ingersoll, ON, CAN

Similar Companies Hiring

Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account