Job Description
- GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
- Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
- Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
- Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks.
- Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements.
- Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
- Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.
Job Requirements:
- Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
- 2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
- Strong expertise in distributed deep learning and parallel training techniques.
- Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
- Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
- Knowledge of performance profiling and optimization tools for HPC and deep learning.
- Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
- Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)
Top Skills
What We Do
GenBio.AI, Inc. (GenBio AI) is an innovative global startup dedicated to developing the world's first AI-driven Digital Organism, an integrated system of multiscale foundation models for predicting, simulating, and programming biology at all levels.
Our goal is to achieve comprehensive, actionable empirical understandings of the mechanisms underlying all organismal physiologies and diseases. This will pave the way for a new paradigm in drug design, bio-engineering, personalized medicine, and fundamental biomedical research, all powered by Generative Biology.
Our founding team consists of world-renowned scientists and researchers in AI and Biology from prestigious institutions such as CMU, MBZUAI, WIS, alongside prominent financial investors.
GenBio AI, a true global effort from day one, is establishing offices in Palo Alto, Paris, and Abu Dhabi.