Senior Platform Engineer - Scalability focus

Reposted 18 Days Ago
Be an Early Applicant
Noida, Gautam Buddha Nagar, Uttar Pradesh
Hybrid
5-5 Annually
Senior level
Machine Learning • Natural Language Processing
The Role
Lead the design and optimization of AI/ML platform architecture, develop automation pipelines, mentor junior engineers, and conduct incident management.
Summary Generated by Built In
As a trusted global transformation partner, Welocalize accelerates the global business journey by enabling brands and companies to reach, engage, and grow international audiences. Welocalize delivers multilingual content transformation services in translation, localization, and adaptation for over 250 languages with a growing network of over 400,000 in-country linguistic resources. Driving innovation in language services, Welocalize delivers high-quality training data transformation solutions for NLP-enabled machine learning by blending technology and human intelligence to collect, annotate, and evaluate all content types. Our team works across locations in North America, Europe, and Asia serving our global clients in the markets that matter to them. www.welocalize.com

To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed below are representative of the knowledge, skill, and/or ability required. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.

Job Reference: #LI-JC1

Role Summary:

We are seeking a Senior Scalability Engineer to design and optimize platforms capable of supporting the significant growth of AI/ML workloads. This role is focused on ensuring the scalability, reliability, and efficiency of AI/ML infrastructure while contributing to the development of robust, high-performance systems. The ideal candidate will collaborate with cross-functional teams to build resilient infrastructure and implement solutions that ensure seamless model deployment, monitoring, and lifecycle management at scale.

Key Responsibilities:

Platform Scalability: Design and implement scalable solutions for AI/ML infrastructure, enabling horizontal scaling, efficient resource utilization, and fault tolerance under high-demand scenarios.

Stability & Reliability: Apply best practices for platform stability, high availability, and disaster recovery, ensuring uninterrupted operations during peak workloads.

Observability & Monitoring: Build and maintain advanced observability frameworks, including monitoring, logging, and tracing solutions, leveraging tools like Datadog.

Automation & Efficiency: Develop automation pipelines for infrastructure provisioning, deployment, and operational workflows to minimize manual intervention and maximize efficiency.

Cross-Functional Collaboration: Work closely with data science, product, and engineering teams to align infrastructure capabilities with organizational goals and ensure seamless model deployment, testing, and lifecycle management.

Cost Optimization: Implement strategies to optimize cloud resource usage and manage platform costs effectively while maintaining performance and reliability.

Incident Response: Participate in incident response efforts, including post-mortems and root cause analyses, to improve platform resilience and prevent recurring issues.

Continuous Improvement: Stay current with industry trends in cloud infrastructure, distributed systems, and observability, applying innovative solutions to enhance platform scalability and performance.

Qualifications:

Educational Background: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

Experience: 5+ years of experience in AI/ML platform engineering, infrastructure, or operations.
Proven track record of designing, scaling, and maintaining large, distributed systems with a focus on scalability, stability, and performance.

Technical Expertise:
Expertise in cloud infrastructure (AWS, GCP, Azure) and infrastructure-as-code tools (Terraform, CloudFormation, etc.).
Strong programming skills in Python and Node.js, with experience building scalable, maintainable systems.
Deep understanding of observability practices, including distributed tracing, log aggregation, and real-time monitoring.

Scalability & Reliability:
Proven ability to design scalable architectures and implement solutions for automated failover and disaster recovery.
Experience in optimizing performance and resource utilization for high-demand environments.

Communication & Collaboration:
Strong communication skills, capable of articulating technical concepts to both technical and non-technical stakeholders.
Ability to collaborate effectively with cross-functional teams to deliver integrated solutions.

Problem-Solving Skills:
Excellent problem-solving skills and the ability to address complex technical challenges in a fast-paced environment.

Cost Optimization: Experience with cost management strategies for cloud-based platforms, with a focus on maintaining an optimal balance between performance and cost.

Top Skills

AWS
Azure
Docker
GCP
Node.js
Python
PyTorch
TensorFlow
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
2,331 Employees
Year Founded: 1997

What We Do

Welocalize accelerates the global business journey by enabling brands and companies to reach, engage, and grow international audiences. Welocalize delivers multilingual content transformation services in translation, localization, and adaptation for over 250 languages with a growing network of over 250,000 in-country linguistic resources. Driving innovation in language services, Welocalize delivers high-quality training data solutions for NLP-enabled machine learning by blending technology and human intelligence to collect, annotate, and evaluate all content types. Our people work across offices in North America, Europe, and Asia serving our global clients in the markets that matter to them.

• Global team of 2,100+
• Offices in North America, Europe and Asia
• Quality Certifications: ISO 9001:2015, ISO/IEC 27001:2013, ISO 17100:2015, ISO 13485:2016, ISO 18587:2017
• Accredited professional translators and interpreters for 250+ languages

www.welocalize.com

Similar Jobs

MetLife Logo MetLife

Platform Engineer

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
India
43000 Employees

MetLife Logo MetLife

Software Engineer

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
India
43000 Employees

MetLife Logo MetLife

Assistant Manager - Technology Services

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
India
43000 Employees

MetLife Logo MetLife

Team leader

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
India
43000 Employees

Similar Companies Hiring

Blissway Thumbnail
Transportation • Software • Machine Learning • Internet of Things • Hardware • Fintech • Computer Vision
Denver, CO
20 Employees
Yooz Thumbnail
Software • Machine Learning • Fintech • Financial Services • Cloud • Automation • Artificial Intelligence
Aimargues, FR
470 Employees
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account