Inworld AI

Staff / Principal Machine Learning Engineer, Serving - Switzerland

Reposted 18 Days Ago

Be an Early Applicant

Hiring Remotely in Switzerland

Remote

125K-125K Annually

Mid level

Software

The Role

Seeking a Machine Learning Engineer with expertise in inference optimization, model acceleration, and distributed systems to enhance AI models for real-time applications.

Summary Generated by Built In

About Inworld

Inworld is a research lab of top researchers and engineers, building the world’s top-ranked realtime voice models.

Today our models are the #1 ranked realtime voice models in the world. They are used to power the largest consumer-facing AI applications available, across categories like health, fitness, learning, therapy, companions, customer experience and media; representing 100s of millions of end users. Our work spans areas like research and development of state-of-the-art models, optimizing realtime inference, and creating best-in-class APIs and products that allow developers to engage their users.

We’ve raised more than $125M from Lightspeed, Section 32, Kleiner Perkins, Microsoft’s M12 venture fund, Founders Fund, Meta and Stanford, among others. Our technology has powered experiences from companies such as NVIDIA, Microsoft Xbox, Niantic, Logitech Streamlabs, Wishroll, Little Umbrella and Bible Chat. We’ve also been recognized by CB Insights as one of the 100 most promising AI companies globally and have been named one of LinkedIn’s Top 10 Startups in the USA.

Who We're Looking For

A year ago, reliably working agentic systems and sub-second multimodal inference at scale barely existed. Nobody has a decade of experience here. So we're not screening for a resume template — we're looking for strong people from varied backgrounds who learn fast, thrive in ambiguity, and can show us what they've built, broken, and understood.

Experience We Find Useful

You don't need all of this. But you need enough to make a case.

Inference Optimization. Deep understanding of modern serving frameworks and techniques like vLLM or TRT-LLM.
Model Acceleration. Hands-on experience with quantization, distillation, caching strategies , continuous batching, paged attention, and speculative decoding.
High-Performance Systems. Proficiency in C++, CUDA, Rust, or highly optimized Python. You know how to profile code and squeeze every ounce of performance out of NVIDIA GPUs.
Distributed Systems & Scaling. Experience with Kubernetes, Ray, custom load balancing, multi-GPU/multi-node inference, and reliably handling thousands of concurrent connections.

Public work. Non-trivial systems programming projects, open-source contributions to major inference engines, or deep-dive technical write-ups.
Full-cycle ownership. You can take a model from the research team, containerize it, optimize its serving, and ensure it runs reliably in production.
Background. PhD in CS, Physics, Math, or equivalent practical experience building backend or ML systems.
Professional fluency in English (written and spoken) is required, as you will be collaborating daily with our US-based leadership and engineering teams.

Who Thrives Here

You don’t need a roadmap to start walking; you’re comfortable picking a direction and building the map as you go.
You believe engineering isn't finished until it’s shipped and stable. You have a bias for impact over purely theoretical optimizations.
You don't just ship code; you obsess over the why. You’re the first to question an architecture if you think there’s a better way to solve the core latency or throughput problem.
You aren't satisfied with "the PM said so." You thrive on deep context and want to understand the fundamental logic behind every decision we make.

What Working Here Is Like

We hand you unclear problems and expect you to make them clear. We value engineers who say "I don't know yet" and then design the benchmark or prototype that finds out. We treat performance, latency, and reliability as first-class product features, not a box to check before launch. Impact comes before everything else, though we support sharing work and open-source contributions that move the field forward. Your work should be visible. Flat structure, fast iterations, minimal process theater.

Location & Employment

Location: remote within Switzerland
Employment type: Full-time, permanent employment
Hiring model: Employment via Employer of Record (EOR)

Candidates must already have the legal right to work in Switzerland, as visa sponsorship is not available for this role. For candidates interested in relocating to the San Francisco Bay Area in the future, full U.S. visa and relocation support may be available, subject to business needs and applicable legal and work authorization requirements.

Skills Required

PhD in CS, Physics, Math, or equivalent practical experience
Proficiency in C++, CUDA, Rust, or highly optimized Python
Experience with Kubernetes and custom load balancing
Deep understanding of modern serving frameworks like vLLM or TRT-LLM
Hands-on experience with quantization and distillation
Strong written and spoken English fluency

View all jobs at Inworld AI

View Inworld AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Mountain View , CA

58 Employees

Year Founded: 2021

What We Do

Inworld AI is a realtime voice AI research lab and platform. It builds the voice layer for consumer AI, the speech that powers companions, tutors, coaches, customer-service agents, and creative and content applications. Inworld's flagship product is text-to-speech, top-ranked on the independent Artificial Analysis Speech Arena, and around it the company offers a full realtime voice stack. That stack includes text-to-speech (Realtime TTS-2, with over 100 languages and natural-language voice steering), speech-to-text, and an LLM router that reaches 200+ models from every major provider with no markup. Developers can call any product on its own, for example Realtime TTS through a single API call, or combine speech-to-text, an LLM, and text-to-speech into one live conversation through the Realtime API. Dedicated GPUs are available for the largest workloads. Inworld serves developers and founders building consumer AI products. As voice quality becomes commoditized, the company's focus is realtime conversation at scale: voice agents that respond in roughly 200 milliseconds for thousands of simultaneous users, at a cost that stays efficient as usage grows. Inworld is a product-oriented research lab. Its founding team pioneered conversational AI and generative models at API.AI (acquired by Google and renamed Dialogflow), Google, and DeepMind, with expertise spanning language models, speech synthesis, multimodal interaction, and design. Inworld has raised more than $125M from investors including Lightspeed, Kleiner Perkins, Founders Fund, CRV, Intel Capital, BITKRAFT Ventures, Section 32, Meta, Microsoft's M12, and LG Technology Ventures. The company was one of six selected for the 2022 Disney Accelerator.