Software Engineer, LLM Training Frameworks Engineer

Posted 25 Days Ago
Be an Early Applicant
San Francisco, CA
160K-230K Annually
3-5 Years Experience
Artificial Intelligence • Information Technology
The Role
Develop and manage large-scale distributed systems within the AI/ML infrastructure domain. Design and optimize AI/ML solutions, develop automation tools, and collaborate with cross-functional teams. Research and deploy machine learning systems for distributed training and inference. Required: Bachelor's degree in Computer Science, 5-8+ years of software engineering experience, proficiency in Python, Go, C++, cloud platforms, and machine learning frameworks. Preferred: In-depth AI/ML workflow understanding, containerization technologies experience, problem-solving skills, continuous learning mindset.
Summary Generated by Built In

Job Responsibilities

  1. Infrastructure Development: Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions.
  2. AI/ML Solutions: Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams.
  3. System Design: Design and implement solutions for distributed storage systems, scheduling systems, high availability, and core reliability issues within large-scale GPU clusters.
  4. Performance Optimization: Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization.
  5. Automation Tools: Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks.
  6. Collaboration: Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem.
  7. Parallel Training: Optimize large-scale parallel training for state-of-the-art deep learning algorithms, including large language models, multi-modality models, diffusion, and reinforcement learning.
  8. Research & Development: Research and develop our machine learning systems, including accelerated computing architecture, management, and monitoring.
  9. Deployment: Deploy machine learning systems for distributed training and inference.
  10. Cross-layer Optimization: Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).

Minimum Qualifications

  1. Bachelor's degree in Computer Science, Engineering, or a related technical field.
  2. 5-8+ years of experience in software engineering, with a strong background in developing and managing large-scale distributed systems, ideally within the AI/ML infrastructure domain.
  3. Proficiency in programming languages such as Python, Go, or C++, with knowledge of cloud computing platforms like AWS, Azure, etc.
  4. Familiarity with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax. Basic understanding of GPU and/or ASIC functionality.
  5. Expertise in at least one or two programming languages in a Linux environment: C/C++, CUDA, Python.
  6. Familiar with open-source distributed scheduling/orchestration/storage frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), HDFS, Redis, S3, etc., with practical experience in machine learning system development.
  7. Mastery of distributed systems principles and participation in the design, development, and maintenance of large-scale distributed systems.
  8. Strong communication and collaboration abilities, effective in working with diverse teams and individuals.

Preferred Qualifications

  1. In-depth understanding of AI/ML workflows, including model training, data processing, and inference pipelines.
  2. Practical experience with containerization technologies (Docker, Kubernetes), automation tools, and monitoring solutions (Prometheus, Grafana).
  3. Exceptional problem-solving skills, capable of analyzing complex systems, identifying bottlenecks, and implementing scalable solutions.
  4. A passion for continuous learning and staying abreast of new technologies and best practices in the AI/ML infrastructure space.
  5. Experience with GPU-based high-performance computing, RDMA high-performance networks (MPI, NCCL, ibverbs).
  6. Familiarity with distributed training framework optimizations (e.g., DeepSpeed, FSDP, Megatron, GSPMD).
  7. Knowledge of AI compiler stacks (torch FX, XLA, MLIR).
  8. Experience with large-scale data processing and parallel computing.
  9. In-depth CUDA programming and performance tuning experience (cutlass, triton).


About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy  


Top Skills

C++
Go
Python
The Company
San Francisco, California
84 Employees
On-site Workplace
Year Founded: 2022

What We Do

Together AI is a research-driven artificial intelligence company. We contribute leading open-source research, models, and datasets to advance the frontier of AI. Our decentralized cloud services empower developers and researchers at organizations of all sizes to train, fine-tune, and deploy generative AI models. We believe open and transparent AI systems will drive innovation and create the best outcomes for society

Jobs at Similar Companies

MassMutual India Logo MassMutual India

Data Engineer

Big Data • Fintech • Information Technology • Insurance • Financial Services
Hyderabad, Telangana, IND

Halter Logo Halter

Experienced Mechanical Engineer

Hardware • Information Technology • Internet of Things • Machine Learning • Software • Business Intelligence • Agriculture
Easy Apply
Hybrid
Auckland, NZL
150 Employees

Silverfort Logo Silverfort

Head of Global Channel & Field Marketing

Information Technology • Sales • Security • Cybersecurity • Automation
Remote
United States
357 Employees

Similar Companies Hiring

Halter Thumbnail
Software • Machine Learning • Internet of Things • Information Technology • Hardware • Business Intelligence • Agriculture
Auckland City, NZ
150 Employees
MassMutual India Thumbnail
Insurance • Information Technology • Fintech • Financial Services • Big Data
Hyderabad, Telangana
Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account