Responsibilities
- Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
- Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
- Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
- Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
- Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
- Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability
Requirements
- 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
- Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
- Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
- Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
- Experience with cloud infrastructure (AWS or GCP)
- Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
- Experience with distributed storage technologies (Lustre, Ceph, S3)
- Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
- Bachelor degree in computer science
Preferred Qualifications
- Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
- Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
- Familiarity with hybrid/on-prem environments
- Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
- Experience with HPC networking (InfiniBand, RDMA)
- Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)
Top Skills
What We Do
Founded in 2007 by two machine learning scientists, The Voleon Group is a quantitative hedge fund headquartered in Berkeley, CA. We are committed to solving large-scale financial prediction problems with statistical machine learning.
The Voleon Group combines an academic research culture with an emphasis on scalable architectures to deliver technology at the forefront of investment management. Many of our employees hold doctorates in statistics, computer science, and mathematics, among other quantitative disciplines.
Voleon's CEO holds a Ph.D. in Computer Science from Stanford and previously founded and led a successful technology startup. Our Chief Investment Officer and Head of Research is Statistics faculty at UC Berkeley, where he earned his Ph.D. Voleon prides itself on cultivating an office environment that fosters creativity, collaboration, and open thinking. We are committed to excellence in all aspects of our research and operations, while maintaining a culture of intellectual curiosity and flexibility.
The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.









