Distributed AI Support Engineer

Reposted 5 Days Ago
Be an Early Applicant
Athens, GRC
In-Office
Entry level
Cloud • Information Technology
The Role
As a Distributed AI Support Engineer, support users in AI workflows on supercomputing resources, test AI stacks, troubleshoot issues, and document processes.
Summary Generated by Built In
Why Join Us

GRNET S.A. provides Internet connectivity, high-quality e-Infrastructures and advanced services to the Greek Educational, Academic and Research community, aiming at minimizing the digital divide and at ensuring equal participation of its members in the global Society of Knowledge. GRNET provides advanced services to the following sectors: Education, Research, Health, Culture.

In 2026, GRNET is expected to host the DAEDALUS supercomputer, which is expected to rank among the Europe’s top supercomputers and will also serve the Greek AI factory - Pharos with special needs for AI workflows. DAEDALUS is based on HPE’s NVIDIA GH200 direct liquid-cooled architecture, designed for about 89 petaflops sustained (115 petaflops peak) for traditional HPC, AI and Big Data/HPDA workloads across CPU and GPU-accelerated partitions backed by 1 PB of high-performance NVMe and 10 PB of usable storage.

As a Distributed AI Support Engineer, you will help researchers, startups, and industry teams turn this cutting-edge infrastructure into real-world AI breakthroughs, working alongside leading European universities, supercomputing centres, and industrial partners in the broader EuroHPC ecosystem. More specifically, you will contribute to the following focus areas. You are not expected to know all the technologies listed below. We are looking for strong AI and Python programming skills, solid fundamentals, and motivation to learn the necessary tools and workflows.

Focus Areas1. User support and operations

Provide first-line support for AI on HPC workloads (LLM, computer vision and other GPU-accelerated workloads): ticket triage, quick diagnosis of failed runs, escalation when hardware issues are suspected. Support users in writing/reviewing/debugging Slurm job scripts launching multi-GPU/multi-node jobs via torchrun, accelerate launch or deepspeed, and support Ray/DeepSpeed and vLLM inference workflows where appropriate.

2. AI/LLM software stacks and containers

Maintain and test shared AI/LLM and computer-vision stacks for HPC and Cloud (PyTorch, DDP/FSDP, Hugging Face Transformers & Accelerate, PEFT/LoRA, Unsloth, DeepSpeed, Bitsandbytes, TensorFlow, RAPIDS, Ray, vLLM and related tooling), ensuring compatibility with NVIDIA drivers, CUDA and NCCL. Design, publish and support recommended Apptainer/Singularity containers (including NGCbased images) for training, fine-tuning, inference and RAG.

3. Debugging, diagnostics and performance

Diagnose common AI/LLM failures (CUDA errors, NCCL timeouts, GPU OOM, distributed hangs, misconfigured environment). Validate driver/CUDA/NCCL stacks and profile/tune workloads using PyTorch Profiler, NVIDIA Nsight (Systems/Compute), TensorBoard, MLflow and Weights & Biases (WandB).

4. Distributed training, quantisation and inference

Guide users on scalable distributed training with PyTorch DDP/FSDP and DeepSpeed (ZeRO/pipeline/tensor parallelism), plus Ray and higher-level frameworks (PyTorch Lightning, Hydra), mapped to node/GPU topology. Support 8-bit/4-bit quantisation and QLoRA workflows (Unsloth, Bitsandbytes) and large-scale inference frameworks (vLLM, NVIDIA TensorRT-LLM, Triton Inference Server); contribute to AI/LLM and computer-vision benchmarking.

5. Data, storage and I/O

Advise on effective storage use for tokenised datasets, vector indices, checkpoints and logs (layout, sharding, cleanup). Troubleshoot dataloader/I/O bottlenecks and recommend suitable formats and caching/staging, including use of NVIDIA DALI, WebDataset, RAPIDS and Dask where appropriate.

6. Monitoring, evaluation and governance

Monitor AI/LLM usage metrics (GPU hours, job success rates, queue waiting times, typical model sizes/frameworks) to drive improvements in stacks, docs and training. Support Access Call evaluation via technical review of AI/LLM proposals and resource feasibility checks.

7. Documentation, training and community building

Develop and maintain task-oriented documentation and cookbooks for AI/LLM workflows on HPC and Cloud. Prepare hands-on tutorials/demos (PyTorch, TensorFlow, Hugging Face Transformers, vLLM, Ray/DeepSpeed, RAPIDS, JupyterLab/ TensorBoard/ MLflow).

8. Reporting, deliverables and outreach

Prepare technical reports on trainings offered; maintain dashboards/databases for trainings, KPIs and survey data. Prepare web content (news, training/service pages), coordinate announcements (newsletters, social media), and support stakeholders and user access processes.

Key Technologies and Tools

Frameworks and libraries: PyTorch, DDP, FSDP, Hugging Face Transformers, Accelerate, PEFT/LoRA, Unsloth, DeepSpeed (ZeRO, pipeline, tensor parallelism), Bitsandbytes, QLoRA, torchvision and other common computervision libraries; TensorFlow; vLLM; Ray Train; Hugging Face Datasets, SentencePiece, FAISS (faissgpu), Gradio, and supporting Python libraries such as SciPy, Matplotlib and Optimum.

Launchers and schedulers: torchrun, accelerate launch, deepspeed, Slurm or similar HPC schedulers, including typical srun / salloc multi-node launch patterns and Ray-based multi-node launchers.

Profiling and debugging: PyTorch Profiler, NVIDIA Nsight Systems/Compute, CUDA tools, NCCL debugging, TensorBoard, MLflow, Weights & Biases (WandB), and HPC debuggers and profilers.

Containers: Apptainer, (Singularity) for image creation and migration and Apptainer-based container workflows.


RequirementsRequired Qualifications

•     Degree in Computer Science, Engineering or a related STEM field. Applications from graduating students and recent graduates will be considered.

•     Strong programming skills in Python and experience with AI frameworks and libraries (e.g. PyTorch, TensorFlow, Hugging Face Transformers, vLLM, Ray, etc.).

•     Hands-on experience training or fine-tuning models on GPUs using PyTorch and related tooling (e.g. torchrun, DDP).

•     Ability to communicate technical concepts clearly to researchers and industry users, both in writing (documentation) and in person (training, support).

Desirable Qualifications

•     Familiarity with GPU architectures and concepts relevant to AI on HPC.

•     Experience with LLM or foundation model training/fine-tuning, distributed training frameworks (FSDP, DeepSpeed) and quantisation methods (8-bit/4-bit, QLoRA, PEFT/LoRA, Bitsandbytes, Unsloth).

•     Experience with profiling and monitoring tools (PyTorch Profiler, NVIDIA Nsight Systems/Compute, cluster monitoring stacks).

•     Experience building or maintaining containerised environments for GPU workloads (Apptainer/Singularity) in an HPC context.

•     Prior involvement in user support for HPC or research computing centres, including documentation, training and best-practice guides.


Benefits

GRNET provides a creative, dynamic and challenging working environment, that encourages team spirit, cooperation and continuous learning of state-of-the-art technology.

•     Opportunities for International collaborations

•     Competitive remuneration package

•     Opportunities for professional development

•     Modern, friendly and innovative working environment

GRNET is an equal opportunity employer that is committed to diversity and inclusion in the workplace. People with a diverse range of backgrounds are encouraged to apply. We do not discriminate against any person based upon their race, age, color, gender identity and expression, disability, national origin, medical conditions, religion, parental status, or any other characteristics protected by law.

All applications will be treated with strict confidentiality.

Skills Required

  • Degree in Computer Science, Engineering or a related STEM field
  • Strong programming skills in Python
  • Experience with AI frameworks and libraries
  • Hands-on experience training or fine-tuning models on GPUs
  • Ability to communicate technical concepts clearly
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Athens
216 Employees
Year Founded: 1998

What We Do

GRNET – National Infrastructures for Research and Technology, provides networking and cloud computing services to academic and research institutions, to educational bodies at all levels, and to agencies of the public, broader public and private sector. GRNET holds a key role as the coordinator of all e-infrastructures in Education and Research. With twenty-plus years’ experience in the fields of advanced network, cloud computing and IT infrastructures and services, and significant international presence, GRNET shall advise the Ministry of Digital Governance on issues relating to the design of advanced information systems and infrastructures. GRNET offers: - A nation-wide fiber optic network - Large scale data centers; - High-performance computing system; - Internet, cloud computing, high-performance computing, authentication & authorization services, security services, as well as audio, voice and video services. GRNET interconnects: - Universities & Ecclesiastical Academies - Research centers - Public hospitals - The Pan-Hellenic School Network - Museums, Libraries and other Cultural Institutions - Public Administation GRNET operates under the auspices of the Ministry of Digital Governance. GRNET receives funding from the Greek State and the European Union. Interested in working at GRNET, you may view all vacancies and apply here: https://apply.workable.com/grnet-sa/

Similar Jobs

SEON Logo SEON

Senior Site Reliability Engineer

Artificial Intelligence • Cybersecurity
In-Office or Remote
28 Locations
415 Employees

Deepgram Logo Deepgram

Research Staff, LLMs

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
In-Office or Remote
49 Locations
150 Employees
150K-250K Annually

Deepgram Logo Deepgram

Account Executive

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
In-Office or Remote
28 Locations
150 Employees

Mondelēz International Logo Mondelēz International

Director Planning Transformation

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
27 Locations
90000 Employees

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account