Machine Learning Infrastructure Engineer

Sorry, this job was removed at 03:51 p.m. (CST) on Thursday, Mar 13, 2025
Hiring Remotely in USA
Remote
Artificial Intelligence • Machine Learning • Software
The Role

About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.


Job Summary:

As a Machine Learning Infrastructure Engineer, you will be responsible for designing, developing, and maintaining the infrastructure that powers our machine learning models. You will work closely with data scientists, engineers, and researchers to ensure seamless integration of machine learning models into our production environment. Your expertise will enable us to scale our machine learning capabilities, improve model performance, and reduce time-to-market.


Key Responsibilities:

Design and Implementation:

    • Design and implement scalable, efficient, and reliable machine learning infrastructure (e.g., containerization, orchestration, and cloud services).
    • Develop and maintain infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation, or Google Cloud Deployment Manager.

Model Serving and Deployment:

    • Design and implement model serving platforms (e.g., TensorFlow Serving, AWS SageMaker, or Azure Machine Learning) for efficient model deployment and management.
    • Develop and maintain automated model deployment pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI.

Data Engineering:

    • Collaborate with data engineers to design and implement data pipelines that feed machine learning models.
    • Ensure data quality, integrity, and security throughout the data lifecycle.

Monitoring and Optimization:

    • Develop and implement monitoring and logging solutions (e.g., Prometheus, Grafana, or ELK Stack) to track model performance, latency, and system health.
    • Optimize infrastructure resources and model performance using techniques like hyperparameter tuning, model pruning, and knowledge distillation.

Collaboration and Communication:

    • Work closely with data scientists, engineers, and researchers to identify infrastructure needs and develop solutions.
    • Communicate technical information effectively to both technical and non-technical stakeholders.

Staying Up-to-Date:

    • Stay current with industry trends, emerging technologies, and best practices in machine learning infrastructure.
    • Participate in conferences, meetups, and online forums to expand knowledge and network with peers.


Ideal Candidate: 

Cloud Computing and Infrastructure:
   - Experience with major cloud platforms (AWS, Azure, GCP)
  - Kubernetes expertise for container orchestration
  - Infrastructure-as-Code (IaC) skills (e.g., Terraform, CloudFormation)

 Machine Learning Operations (MLOps):
  - Familiarity with ML model lifecycle management
  - Experience with ML model serving frameworks (e.g., VLLM, TorchServe, SGLang)
  - Knowledge of model versioning and experiment tracking tools‍

 Deep Learning and NLP:
  - Strong understanding of transformer architectures and LLMs
  - Experience with popular deep learning frameworks (PyTorch)
  - Familiarity with NLP concepts and techniques

API Development and Management:
  - RESTful API design and implementation
  - API gateway management and security
  - Experience with OpenAPI/Swagger specifications

Performance Optimization:
  - Proficiency in GPU acceleration techniques
  - Experience with model quantization and pruning
  - Knowledge of distributed inference and parallel computing

Programming Languages:
  - Strong Python skills
  - Familiarity with C++ for potential low-level optimizations
  - Shell scripting for automation



Requirements:

Education:

    • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Experience:

    • 3+ years of experience in machine learning infrastructure, DevOps, or a related field.
    • Experience with cloud providers (e.g., AWS, GCP, or Azure) and containerization (e.g., Docker).

Technical Skills:

    • Proficiency in programming languages like Python, Java, or C++.
    • Experience with machine learning frameworks like TensorFlow, PyTorch, or Scikit-learn.
    • Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
    • Knowledge of container orchestration tools like Kubernetes or Docker Swarm.

Soft Skills:

    • Excellent communication, collaboration, and problem-solving skills.
    • Ability to work in a fast-paced environment and prioritize tasks effectively.


Nice to Have:

Certifications:

    • Cloud provider certifications (e.g., AWS Certified DevOps Engineer or GCP Professional Cloud Developer).
    • Machine learning certifications (e.g., TensorFlow Certified Developer or PyTorch Certified Engineer).

Experience with:

    • Model serving platforms like TensorFlow Serving or AWS SageMaker.
    • Automated model deployment pipelines using tools like Jenkins or GitLab CI/CD.
    • Monitoring and logging solutions like Prometheus or ELK Stack.

Knowledge of:

    • Model explainability and interpretability techniques.
    • Data privacy and security best practices.


What We Offer:

  1. Competitive Salary: A salary commensurate with experience and industry standards.
  2. Stock Options: Equity in [Company Name] to give you a stake in our success.
  3. Comprehensive Benefits: Health, dental, and vision insurance, as well as 401(k).
  4. Professional Development: Opportunities for growth, training, and conference attendance.
  5. Collaborative Environment: A dynamic, diverse team that values innovation and open communication.

Similar Jobs

Deepgram Logo Deepgram

Platform Engineer

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
Remote
USA
150 Employees
160K-220K Annually
Easy Apply
Remote
U.S.
36 Employees

Motional Logo Motional

Principal Engineer

Artificial Intelligence • Automotive • Machine Learning • Transportation
Remote or Hybrid
3 Locations
765 Employees
175K-234K Annually

Motional Logo Motional

Senior Software Engineer

Artificial Intelligence • Automotive • Machine Learning • Transportation
Remote or Hybrid
3 Locations
765 Employees
159K-207K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
48 Employees
Year Founded: 2023

What We Do

Arcee AI delivers purpose-built AI agents, powered by industry-leading small language models (SLMs) for enterprise applications. Their offering, Arcee Orchestra, is an end-to-end agentic AI solution that enables businesses to create AI agents for complex tasks. The solution makes it easy to build custom AI workflows that automatically route tasks to specialized SLMs to deliver detailed, trustworthy responses, fast.

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account