BJAK

DevOps Engineer - Platform Reliability (Remote, China)

Posted Yesterday

Be an Early Applicant

Hiring Remotely in China

Remote

Mid level

Artificial Intelligence • Fintech • Software • Financial Services

The Role

Own and improve platform reliability for BJAK's AI automation systems: manage cloud infrastructure, CI/CD, deployments, observability, incident response, redundancy and security to ensure high availability and safe rollouts across services.

Summary Generated by Built In

BJAK’s automation systems support customer journeys across quote generation, policy issuance, claims, payments, renewals and insurer integrations. These systems are business-critical—meaning reliability, uptime and safe deployments directly impact customers and operations.

We're looking for a DevOps Engineer based in China to strengthen platform reliability, improve infrastructure resilience and ensure BJAK’s AI automation systems run safely and consistently at scale.

This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to build and maintain highly reliable production systems.

The Mission

Build and maintain a highly reliable platform for BJAK’s AI automation systems by improving infrastructure stability, deployment safety and operational resilience across all services.

What You’ll Own

Own and improve platform reliability across production systems and environments.
Manage cloud infrastructure, deployment pipelines and runtime environments.
Design and improve CI/CD workflows to enable safe, fast and repeatable releases.
Build and enhance monitoring, alerting, logging and system observability.
Lead incident response efforts and perform structured root cause analysis.
Improve system resilience through redundancy, failover and recovery mechanisms.
Work with engineering teams to reduce production risk through better deployment and system design practices.
Strengthen infrastructure security, access control and secrets management.
Support reliability for business-critical workflows across multiple countries and services.
Continuously improve operational discipline, uptime and system stability.

What We're Looking For

Experience in DevOps, SRE, platform engineering or infrastructure-focused roles.
Strong understanding of cloud infrastructure, CI/CD pipelines and deployment systems.
Experience with production monitoring, alerting and incident management practices.
Ability to troubleshoot infrastructure and production issues in a structured and calm manner.
Strong understanding of reliability engineering principles (availability, fault tolerance, recovery).
Experience supporting business-critical or high-availability systems.
Strong ownership mindset during incidents and operational failures.
Practical judgment on reliability, performance, security and cost trade-offs.
Comfortable working closely with engineering teams in fast-paced environments.
Low ego, disciplined and focused on long-term system stability.

Bonus Points

Experience with AWS, GCP, Azure or similar cloud platforms.
Experience with Kubernetes, Docker or container orchestration.
Experience with infrastructure-as-code tools (Terraform, Ansible, Pulumi, etc.).
Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
Experience with zero-downtime deployments, blue-green or canary release strategies.
Experience supporting distributed or high-traffic production systems.
Strong knowledge of security best practices in cloud infrastructure.
Experience in fintech, insurance or regulated industry environments.
Contributions to platform reliability or infrastructure scaling initiatives.

The Kind of Builder We Want

Calm and structured under pressure, especially during production incidents.
Hands-on with infrastructure and deeply familiar with production systems.
Thinks in failure modes, system risks and recovery paths.
Proactive in preventing incidents, not just reacting to them.
Strong focus on uptime, reliability and operational discipline.
Careful and deliberate when making production changes.
Builds systems engineers can trust to deploy and operate safely.

This Role Is Not For

People who only react after systems fail instead of preventing them.
Engineers who are careless with production changes or access control.
Individuals who ignore monitoring, alerting or operational discipline.
People who make risky infrastructure changes without proper evaluation.
Candidates who cannot stay calm during incidents or outages.

Success in This Role

You'll be successful if you can:

Improve platform uptime, stability and deployment safety.
Reduce production incidents and infrastructure-related failures.
Strengthen monitoring, alerting and system visibility across services.
Enable engineers to deploy with confidence and lower operational risk.
Improve resilience of BJAK’s AI automation platform as it scales.

Why Join BJAK

Build Reliable AI Platform Infrastructure – Support systems powering end-to-end insurance automation.
High-Impact Engineering – Solve real-world reliability and scaling challenges.
Global Engineering Team – Work with experienced engineers across multiple countries.
Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
International Exposure – Build systems used across Southeast Asia markets.
Learning & Development Budget – Support continuous technical growth and certifications.
High Ownership Environment – Strong autonomy over infrastructure and reliability strategy.
Modern Engineering Culture – Focus on stability, observability and engineering excellence.
Competitive Compensation – Attractive salary package based on experience and impact.

Interview Process

We assess infrastructure depth, reliability thinking and production problem-solving ability. The process usually includes application review, two interviews and a technical scenario or systems discussion.

Skills Required

Experience in DevOps, SRE, platform engineering or infrastructure-focused roles.
Strong understanding of cloud infrastructure and deployment systems.
Experience designing and managing CI/CD pipelines and deployment workflows.
Experience with production monitoring, alerting and incident management practices.
Ability to troubleshoot infrastructure and production issues calmly and structurally.
Strong understanding of reliability engineering principles (availability, fault tolerance, recovery).
Experience supporting business-critical or high-availability systems.
Strong ownership mindset during incidents and operational failures.
Practical judgment on reliability, performance, security and cost trade-offs.
Comfortable working closely with engineering teams in fast-paced environments.
Calm and structured under pressure during production incidents.
Knowledge of infrastructure security, access control and secrets management.
Experience with AWS, GCP or Azure.
Experience with Kubernetes and Docker (container orchestration).
Experience with infrastructure-as-code tools such as Terraform, Ansible or Pulumi.
Experience with observability stacks like Prometheus, Grafana, ELK or Datadog.
Experience with zero-downtime deployments or release strategies (blue-green, canary).
Experience supporting distributed or high-traffic production systems.
Experience in fintech, insurance or regulated industry environments.

View all jobs at BJAK

View BJAK Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Petaling Jaya

253 Employees

Year Founded: 2019

What We Do

Our mission is to develop technology based solutions to improve financial inclusion. We develop new & innovative platforms & services globally. For example, we are the first platform to simplify and digitise comprehensive life and medical insurance, supported by AI agent. BJAK is the largest insurance platform in Southeast Asia. If you enjoy building cutting edge platform-ecosystems that gives equal access to financial services to everyone at scale, join us