Description
Who we are:
CloudWalk is a fintech company reimagining the future of financial services. We are building intelligent infrastructure powered by AI, blockchain, and thoughtful design. Our products serve millions of entrepreneurs across Brazil and the US every day, helping them grow with tools that are fast, fair, and built for how business actually works. Learn more at cloudwalk.io.
Who We’re Looking For:
We’re looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.
This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big. You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.
What You'll Do:
What We’re Looking For:
Bonus Points:
How We Hire:
If you’ve trained LLMs before - or helped others do it better - this role is for you. Even if you don’t check every box, if you’re confident working with distributed compute and real-world LLM workloads, we want to hear from you.
Technologies
GoPyTorchFoundationMLflowBlockchainTransformersDeepSpeed
Nice to Have
Experience managing Kubernetes-based GPU clusters and employing tools like Ray for orchestration.Familiarity with experiment tracking tools like MLflow and Weights & Biases (W&B).Understanding of mixed precision training, ZeRO stages, and model parallelism techniques.Proficiency with command-line tooling for profiling, logging, and monitoring machine learning workflows.Ability to identify and resolve data loading bottlenecks and implement dataset streaming.
Must Have
Expertise in distributed training with DeepSpeed, FSDP, or Hugging Face Accelerate in multi-GPU or multi-node environments.Strong PyTorch programming skills, including the ability to write custom training loops and implement callbacks.Experience with the Hugging Face ecosystem, including Transformers and Datasets, with the ability to integrate it effectively into projects.Proficient understanding of GPU infrastructure, container technology, and job scheduling, with the ability to troubleshoot related issues.A mindset focused on resilience, ensuring code can checkpoint, resume, log, and recover from errors during training processes.Ability to collaborate effectively with researchers, refining their training scripts for production readiness.