Research Engineer - Distributed Training

Reposted 6 Days Ago
Be an Early Applicant
Hiring Remotely in BRA
Remote
Mid level
Payments • Software
The Role
Design, implement, and maintain a distributed training pipeline for large language models, optimizing performance across GPUs and collaborating with teams.
Summary Generated by Built In
About CloudWalk:
CloudWalk is building the intelligent infrastructure for the future of financial services. Powered by AI, blockchain, and thoughtful design, our systems serve millions of entrepreneurs across Brazil and the US every day.
Our AI team trains large-scale language models that power real products - from payment intelligence and credit scoring to on-device assistants for merchants.

About the Role:
We’re looking for a Research Engineer to design, scale, and evolve CloudWalk’s distributed training stack for large language models. You’ll work at the intersection of research and infrastructure - running experiments across DeepSpeed, FSDP, Hugging Face Accelerate, and emerging frameworks like Unsloth, TorchTitan, and Axolotl.

You’ll own the full training lifecycle: from cluster orchestration and data streaming to throughput optimization and checkpointing at scale. If you enjoy pushing the limits of GPUs, distributed systems, and next-generation training frameworks, this role is for you.

Responsibilities:

  • Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
  • Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
  • Optimize performance, memory, and cost across large training workloads.
  • Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
  • Build internal tools and templates that accelerate research-to-production transitions.
  • Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.

Requirements:

  • Strong background in PyTorch and distributed training (DeepSpeed, FSDP, Accelerate).
  • Hands-on experience with large-scale multi-GPU or multi-node training.
  • Familiarity with Transformers, Datasets, and mixed-precision techniques.
  • Understanding of GPUs, containers, and schedulers (Kubernetes, Slurm).
  • Mindset for reliability, performance, and clean engineering.

Bonus:

  • Experience with Ray, MLflow, or W&B.
  • Knowledge of ZeRO, model parallelism, or pipeline parallelism.
  • Curiosity for emerging open-source stacks like Unsloth, TorchTitan, and Axolotl.

Our process is simple: a deep conversation on distributed systems and LLM training, and a cultural interview.

Competitive salary, equity, and the opportunity to shape the next generation of large-scale AI infrastructure at CloudWalk.

Top Skills

Accelerate
Axolotl
Deepspeed
Fsdp
Kubernetes
Mlflow
PyTorch
Ray
Slurm
Torchtitan
Unsloth
W&B
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
São Paulo, São Paulo
501 Employees
Year Founded: 2013

What We Do

We are democratizing the payments industry in Brazil, by empowering entrepreneurs through technological, inclusive, and life-changing solutions. Based in Brazil, CloudWalk is a high-end global payment network built on modern technology and proprietary blockchain, focused in bringing a revolution to the payment ecosystem for small and medium-sized businesses. As a unicorn, the company has provided its customers with more than R$ 1 billion in savings by charging fair fees on its transactions and is now present in more than 300.000 businesses across 5.000 brazilian cities. With investors such as the Valor Capital Group, HIVE Ventures and Coatue, the company has already raised US$ 365.5 million in investments and R$3.4 billion in FDICs for anticipation of receivables in its network of financial solutions. In 2022, it was the only brazilian fintech to be featured in the "The Retail Tech 100" ranking by CB Insights, on the "Protection Solutions for Payments and Frauds".

Similar Jobs

CrowdStrike Logo CrowdStrike

Technical Account Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Brazil
10000 Employees
Remote or Hybrid
São Paulo, BRA
1100 Employees
Remote or Hybrid
Brazil
289097 Employees

ServiceNow Logo ServiceNow

Sales Executive

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
SP, BRA
28000 Employees

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Rain Thumbnail
Web3 • Payments • Infrastructure as a Service (IaaS) • Fintech • Financial Services • Cryptocurrency • Blockchain
New York, NY
40 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account