What you’ll be doing
- Stay on top of prompt engineering techniques, research, and best practices.
- Develop and curate golden datasets for prompt evaluation and regression testing across modalities, ensuring long-term quality control and reproducibility.
- Design, test, and refine prompts to support a wide range of generative AI applications—not limited to chat, but also including audio synthesis, avatar animation, lip-sync alignment, and product image generation.
- Document best practices and create reusable prompt templates to support internal stakeholders, improving prompt consistency, clarity, and alignment across teams.
- Refactoring existing prompts to follow best-practice approaches.
- Collaborate with cross-functional teams to integrate AI-driven features into real-world product experiences, ensuring prompts are aligned with user needs, system constraints, and business goals.
- Build and maintain prompt libraries with clear versioning, metadata tagging, and usage patterns to support scalable and reusable development.
- Drive continuous improvement in prompt performance by using both automated metrics and human-in-the-loop evaluation pipelines.
- Contribute to and extend our internal evaluation framework—designing new evaluation flows, creating prompt-specific test cases, and defining metrics tailored to multi-modal output.
You will have
- Bachelor’s or Master’s degree in STEM or related field.
- Practical experience working with large language models and/or multi-modal generative models (e.g., text-to-audio, text-to-image, video or avatar generation).Familiarity with prompt techniques such as zero-shot, few-shot, chain-of-thought, tool usage, and retrieval-based augmentation.
- Strong analytical and linguistic intuition, with the ability to translate abstract goals into effective machine-readable instructions.
- Deep interest in language and communication systems, and how humans and machines can interact effectively through prompt-based interfaces.
- Ability to create and maintain curated evaluation datasets (“golden sets”) to support ongoing testing and performance benchmarking.
- Strong writing and communication skills, with the ability to explain prompt behavior, rationale, and trade-offs to technical and non-technical audiences
We’ll be excited if you have
- Hands-on experience with Python or another scripting language of choice.
- Experience with Jupyter Notebooks, or LLM ops tools and libraries such as LangChain, LangFuse, PromptLayer, or vector search systems.
- Experience designing or working within evaluation pipelines, including human and automated evaluations, metric design, and result interpretation.
Top Skills
What We Do
Firework is the world's leading immersive digital transformation and engagement platform with shoppable video, live streaming commerce, and monetization capabilities.
Powering over 600 direct-to-consumer brands, retailers, and media publishers worldwide, Firework brings TikTok-like interactive video experiences to your own websites and app. We enable customers to create and host native, shoppable video content for engaging product discovery, seamless shopping experiences, and a deeper emotional connection with consumers. The company is backed by IDG Capital, Lightspeed Venture Partners, and GSR Ventures, with over $90 million in capital raised to date with offices in the US(SF and NYC), Toronto, Poland, Slovakia, Brazil, and China.
Why Work With Us
We are a diverse team where everyone belongs. We are creative, curious, and cool in a nerdy way. We believe in growth, results, and in each other and that perfection is a work-in-progress. We are just the right amount of extra and want to change the digital game.









