Platform Engineer, Model Shaping

Reposted 4 Days Ago
Hiring Remotely in San Francisco, CA
In-Office or Remote
200K-290K
Mid level
Artificial Intelligence • Information Technology
The Role
The Platform Engineer will design and build infrastructure for model customization and evaluation, improving platform reliability and internal tooling.
Summary Generated by Built In
About Model Shaping

The Model Shaping team at Together AI works on products and research for tailoring open foundation models to downstream applications. We build services that allow machine learning developers to choose the best models for their tasks and further improve these models using domain-specific data. In addition to that, we develop new methods for more efficient model training and evaluation, drawing inspiration from a broad spectrum of ideas across machine learning, natural language processing, and ML systems.

About the Role

As a Platform Engineer at Model Shaping, you will work on the foundational layers of Together’s platform for model customization and evaluation. You will design the infrastructure and backend services that will allow us to sustainably and reliably scale the systems powering production workflows launched by our users, as well as internal research experiments.

You will operate in a cross-functional environment, collaborating with other engineers and researchers in the team to improve the infrastructure based on the needs of projects they work on. You will also interact with other engineering teams at Together (such as Commerce, Data Engineering, and Cloud Infrastructure) to integrate the services developed by Model Shaping with systems developed by those teams.

Responsibilities
  • Design and build Together’s systems and infrastructure for model customization, including user-facing features and internal improvements
  • Contribute to reliability improvements for the platform, participating in an on-call rotation and improving processes for incident response
  • Create and improve internal tooling for deployment, continuous integration, and observability
  • Build a job orchestration platform spanning multiple data centers, supporting a highly heterogeneous hardware landscape
  • Partner with teams developing internal services, co-designing these services and incorporating them in systems built by Model Shaping
Requirements
  • 3+ years of experience in building infrastructure or backend components of production services
  • Comfortable with the fundamentals of Linux environments and modern container/orchestration stacks (e.g., Docker and Kubernetes) 
  • Strong software engineering background in Python or Go
  • Experienced with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
  • Skilled with analyzing non-trivial issues of complex software systems and documenting your findings
  • Have cloud environment (e.g., AWS/GCP/Azure) administration experience, preferably with a hybrid bare-metal/cloud environment
  • Strong communication skills, willing to document systems and processes and collaborate with peers of varying technical expertise
Stand-out experience
  • Developing large-scale production systems with high reliability requirements
  • Pipeline orchestration frameworks (e.g., Kubeflow, Argo Workflows, Flyte)
  • Managing GPU workloads on HPC clusters, ideally with hands-on experience in operating NVIDIA’s networking stack (e.g., NCCL, Mellanox firmware, GPUDirect RDMA)
  • Deployment of services for AI training or inference
  • Maintaining or contributing to open-source projects
About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancements such as FlashAttention, RedPajama, SWARM Parallelism, and SpecExec. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is $200,000 - $290,000. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy

Top Skills

Ansible
Argocd
AWS
Azure
Docker
GCP
Github Actions
Go
Grafana
Kubernetes
Linux
Prometheus
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
San Francisco, California
84 Employees
Year Founded: 2022

What We Do

Together AI is a research-driven artificial intelligence company. We contribute leading open-source research, models, and datasets to advance the frontier of AI. Our decentralized cloud services empower developers and researchers at organizations of all sizes to train, fine-tune, and deploy generative AI models. We believe open and transparent AI systems will drive innovation and create the best outcomes for society

Similar Jobs

ServiceNow Logo ServiceNow

Account Executive

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
San Diego, CA, USA
88K-136K Annually

ServiceNow Logo ServiceNow

Senior Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
San Diego, CA, USA
127K-215K Annually

ServiceNow Logo ServiceNow

Staff Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Santa Clara, CA, USA
188K-328K Annually

ServiceNow Logo ServiceNow

Staff Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
San Diego, CA, USA
147K-258K Annually

Similar Companies Hiring

Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account