What You'll Do:
- Own the architecture and maintenance of our distributed training pipeline;
- Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate;
- Design and debug multi-node/multi-GPU training runs (Kubernetes-based);
- Optimize training performance: memory usage, speed, throughput, and cost;
- Help manage experiment tracking, artifact storage, and resume logic;
- Build reusable, scalable training templates for internal use;
- Collaborate with researchers to bring their training scripts into production shape.
What We’re Looking For:
- Expertise in distributed training: Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;
- Strong PyTorch background: Comfortable writing custom training loops, schedulers, or callbacks;
- Hugging Face stack experience: Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it;
- Infra literacy: You understand how GPUs, containers, and job schedulers work together. You can debug cluster issues, memory bottlenecks, or unexpected slowdowns;
- Resilience mindset: You write code that can checkpoint, resume, log correctly, and keep running when things go wrong;
- Collaborative builder: You don’t mind digging into other people’s scripts, making them robust, and helping everyone train faster.
Bonus Points:
- Experience with Kubernetes-based GPU clusters and Ray;
- Experience with experiment tracking (MLflow, W&B);
- Familiarity with mixed precision, ZeRO stages, model parallelism;
- Comfort with CLI tooling, profiling, logging, and telemetry;
- Experience with dataloading bottlenecks and dataset streaming.
How We Hire:
- Online assessment: technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code)
- Technical interview: deep dive into distributed training theory and reasoning (no code)
- Cultural interview
- If you are not willing to take an online quiz, do not apply.
Similar Jobs
What We Do
We are democratizing the payments industry in Brazil, by empowering entrepreneurs through technological, inclusive, and life-changing solutions. Based in Brazil, CloudWalk is a high-end global payment network built on modern technology and proprietary blockchain, focused in bringing a revolution to the payment ecosystem for small and medium-sized businesses. As a unicorn, the company has provided its customers with more than R$ 1 billion in savings by charging fair fees on its transactions and is now present in more than 300.000 businesses across 5.000 brazilian cities. With investors such as the Valor Capital Group, HIVE Ventures and Coatue, the company has already raised US$ 365.5 million in investments and R$3.4 billion in FDICs for anticipation of receivables in its network of financial solutions. In 2022, it was the only brazilian fintech to be featured in the "The Retail Tech 100" ranking by CB Insights, on the "Protection Solutions for Payments and Frauds".









