Software Engineer, Infrastructure Generalist

Posted 13 Days Ago
Be an Early Applicant
San Francisco, CA
In-Office
300K-350K
Senior level
Artificial Intelligence • Information Technology
The Role
The Staff Software Engineer will design and build scalable infrastructure for LLM research, collaborate with researchers, and optimize system performance.
Summary Generated by Built In

Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. 

We are a small team of scientists, engineers, and builders who've created some of the most widely used AI products, like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About This Role

We're looking for a Staff Software Engineer—a generalist across the backend—to help build the systems that power our foundation models.

You'll join a small, high-impact team responsible for architecting and scaling the core infrastructure behind everything we do. You’ll work across the full technical stack, solving complex distributed systems problems and building robust, scalable platforms.

Infrastructure is critical to us: it's the bedrock that enables every breakthrough. You'll work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across our models, products, and data assets. 

What You’ll Do
  • Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities.
  • Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
  • Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.
  • Implement and maintain monitoring and alerting to support platform reliability and performance.
  • Collaborate with research teams to unlock new features, improve system efficiency, and accelerate training cycles.
Required Qualifications
  • Technical expertise:
    • 5+ years of experience building distributed systems, ideally supporting high-scale applications or research platforms.
    • Fluent in containerization, orchestration, and distributed compute frameworks.
    • Hands-on experience with Kubernetes, Terraform, service discovery, and workflow orchestration tools.
    • Experience with network programming, load balancing, or distributed consensus systems.
    • Extensive experience with performance optimization, caching strategies, and system scalability patterns.
    • Deeply familiar with cloud infrastructure, microservices architectures, and both synchronous and asynchronous processing.
    • Strong knowledge of databases, storage systems, and how architecture choices impact performance at scale.
    • Proactive about automation, testing, and building tools that empower engineering teams.
  • System Design & Performance:
    • Strong proficiency in systems programming languages (Rust) and scripting (Python)
    • Familiarity with performance profiling and optimization in high-throughput distributed environments
    • Track record of architecting resilient systems and debugging complex production issues
    • Excellent communication and collaboration skills
Strong Candidates May Also Have
  • Experience supporting machine learning training infrastructure or GPU clusters
  • Background at AI research labs, high-performance computing centers, or ML-focused companies
  • Published work on distributed systems, infrastructure, or performance optimization
  • Open-source contributions to infrastructure projects, orchestration tools, or distributed computing frameworks
  • Experience with specialized hardware (GPUs, TPUs) and their integrations into distributed training systems
Logistics
  • Location: This role is based in San Francisco, California. 
  • Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
  • Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
  • Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $300,000-$350,000 USD.
  • We encourage you to apply even if you do not believe you meet every single qualification.
  • As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Top Skills

Cloud Infrastructure
Databases
Distributed Systems
Kubernetes
Python
Rust
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San FranciscoC, CA
91 Employees

What We Do

Thinking Machines Lab is an artificial intelligence research and product company. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.

While AI capabilities have advanced dramatically, key gaps remain. The scientific community's understanding of frontier AI systems lags behind rapidly advancing capabilities. Knowledge of how these systems are trained is concentrated within the top research labs, limiting both the public discourse on AI and people's abilities to use AI effectively. And, despite their potential, these systems remain difficult for people to customize to their specific needs and values. To bridge the gaps, we're building Thinking Machines Lab to make AI systems more widely understood, customizable and generally capable.

We are scientists, engineers, and builders who've created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

Similar Jobs

Rubrik Logo Rubrik

Director, Renewals and Customer Success Operations

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Cybersecurity • Data Privacy
In-Office
Palo Alto, CA, USA
3000 Employees
200K-281K Annually

PwC Logo PwC

Data Scientist

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
67 Locations
370000 Employees
99K-232K Annually

PwC Logo PwC

Financial Services Tax - Real Estate and Infrastructure Transactions Senior Associate

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Remote or Hybrid
69 Locations
370000 Employees
77K-214K Annually

PwC Logo PwC

Data Scientist

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
65 Locations
370000 Employees
77K-202K Annually

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account