LLM Inference Deployment Engineer

Posted 21 Days Ago
Hiring Remotely in Canada, KS, USA
In-Office or Remote
180K-240K Annually
Mid level
Artificial Intelligence • Hardware • Software
The Role
The LLM Inference Deployment Engineer will deploy and scale large language models efficiently on AI accelerators, focusing on model optimization and low-latency inference.
Summary Generated by Built In

EnCharge AI is a leader in advanced AI hardware and software systems for edge-to-cloud computing. EnCharge’s robust and scalable next-generation in-memory computing technology provides orders-of-magnitude higher compute efficiency and density compared to today’s best-in-class solutions. The high-performance architecture is coupled with seamless software integration and will enable the immense potential of AI to be accessible in power, energy, and space constrained applications. EnCharge AI launched in 2022 and is led by veteran technologists with backgrounds in semiconductor design and AI systems.

About the Role

EnCharge AI is seeking an LLM Inference Deployment Engineer to optimize, deploy, and scale large language models (LLMs) for high-performance inference on its energy efficient AI accelerators. You will work at the intersection of AI frameworks, model optimization, and runtime execution to ensure efficient model execution and low-latency AI inference.  

Responsibilities

  • Deploy and optimize LLMs (GPT, LLaMA, Mistral, Falcon, etc.) post-training from libraries like HuggingFace
  • Utilize inference runtimes such as ONNX Runtime, vLLM for efficient execution.
  • Optimize batching, caching, and tensor parallelism to improve LLM scalability in real-time applications.
  • Develop and maintain high-performance inference pipelines using Docker, Kubernetes, and other inference servers. 

Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field.
  • Experience in LLM inference deployment, model optimization, and runtime engineering.
  • Strong expertise in LLM inference frameworks (PyTorch, ONNX Runtime, vLLM, TensorRT-LLM, DeepSpeed).
  • In-depth knowledge of the Python programming language for model integration and performance tuning.
  • Strong understanding of high-level model representations and experience implementing framework-level optimizations for Generative AI use cases
  • Experience with containerized AI deployments (Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, TorchServe).
  • Strong knowledge of LLM memory optimization strategies for long-context applications.
  • Experience with real-time LLM applications (chatbots, code generation, retrieval-augmented generation). 

EnchargeAI is an equal employment opportunity employer in the United States.

The salary range for this position is $180,000 to $240,000 USD ($175,000 to $245,000 CAD) per year. Actual compensation offered will be determined based on job-related knowledge, skills, and experience.

Skills Required

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.
  • Experience in LLM inference deployment, model optimization, and runtime engineering.
  • Strong expertise in LLM inference frameworks (PyTorch, ONNX Runtime, vLLM, TensorRT-LLM, DeepSpeed).
  • In-depth knowledge of the Python programming language for model integration and performance tuning.
  • Experience with containerized AI deployments (Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, TorchServe).
  • Strong knowledge of LLM memory optimization strategies for long-context applications.
  • Experience with real-time LLM applications (chatbots, code generation, retrieval-augmented generation).
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
31 Employees
Year Founded: 2022

What We Do

EnCharge AI is a leader in advanced AI hardware and software systems for edge computing. EnCharge’s robust and scalable next-generation in-memory computing technology provides orders-of-magnitude higher compute efficiency and density compared to today’s best-in-class solutions. The high-performance architecture is coupled with seamless software integration and will enable the immense potential of AI to be accessible in power, energy, and space constrained applications. EnCharge AI launched in 2022 and is led by veteran technologists with backgrounds in semiconductor design and AI systems.

Similar Jobs

Wipfli Logo Wipfli

Transaction Advisory Services Manager

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
117K-158K Annually

Wipfli Logo Wipfli

Director - Transaction Advisory Services

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
142K-191K Annually

CrowdStrike Logo CrowdStrike

Infrastructure Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
140K-215K Annually

PNC Bank Logo PNC Bank

Technology Engineer Sr (Java Full Stack)

Machine Learning • Payments • Security • Software • Financial Services
Remote or Hybrid
USA
55000 Employees
91K-203K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account