Senior SRE Engineer

Posted 3 Days Ago
Be an Early Applicant
Hiring Remotely in Taiwan
Remote
Senior level
Fintech • Information Technology • Payments • Financial Services
The Role
Operate and automate large-scale Linux and hybrid cloud infrastructure, manage HPC clusters and storage, build containerized and CI/CD workflows, support on-call production incidents, and develop internal GenAI platform components (RAG, LangChain/Bedrock). Drive end-to-end subsystem ownership and documentation with strong autonomy and self-direction.
Summary Generated by Built In
Responsibilities

Linux Systems & Automation (Core)

- Manage large-scale Linux environments: troubleshooting and root-cause analysis
- Write maintainable, hand-off-ready Bash / Ansible / Python automation
- On-call for infrastructure, CI/CD, and production service incidents

HPC Cluster & Storage

- Operate HPC clusters (Slurm) along with usage analytics, auditing, and monitoring tools
- Maintain and plan storage for compute environments (Lustre, NAS)

Cloud & Hybrid Infrastructure

- Manage multi-cloud environments (AWS, Alibaba Cloud, GCP) with Terraform / AWS CDK
- Build and operate Docker (ECS) / Kubernetes (EKS) environments and their deployment workflows

CI/CD & Developer Experience

- Operate self-hosted GitLab server and Runner fleet
- Operate CI/CD systems and design deployment pipelines for research and other projects

GenAI / Internal Platform

- Build internal AI platforms (LangChain / LangGraph / Bedrock, Elasticsearch RAG)
- Develop MCP servers, chatbots, AI agents, and similar services

Requirements

- **5+ years** of hands-on Linux systems administration and infrastructure operations experience
- Solid Linux internals knowledge (process / memory / filesystem / networking / systemd / cgroup); able to localize issues even without complete logs
- Strong Bash / Shell scripting skills — able to write maintainable scripts that others can pick up
- Programming ability for data processing, CLI tools, and API services; Python proficiency preferred
- Solid storage fundamentals with hands-on experience: RAID levels and rebuild trade-offs, filesystem selection, snapshot and backup planning; NAS / shared storage (NFS / SMB) operations experience
- Experience with at least one major public cloud (AWS / GCP / Alibaba Cloud) and IaC tooling (Terraform / CDK / Ansible)
- Familiar with containerization and orchestration (Docker, Kubernetes)
- CI/CD pipeline design and operations experience (GitLab CI / Jenkins / Airflow)
- Able to own a cross-service subsystem end-to-end: design, implementation, documentation, handoff
- **Strong autonomy**: can drive a problem from discovery, root-cause investigation, decision-making, to delivery with minimal supervision; able to make judgment calls under incomplete information and proactively communicate progress, risks, and rationale
- **Self-directed**: doesn't wait for tickets — identifies problems worth solving and prioritizes them independently

Nice to Have

- HPC scheduler experience (Slurm / PBS / LSF)
- Parallel filesystem operations experience (Lustre / GPFS / BeeGFS)
- Advanced Linux performance analysis (perf, eBPF, ftrace) and kernel parameter tuning
- DB operations experience (MySQL, ClickHouse)
- Low-latency network tuning and cross-datacenter link optimization
- LLM application development (LangChain, RAG, Agent, MCP)
- Self-managed Kubernetes experience (Kubespray, kubeadm)
- GPU server operations (single-node): NVIDIA driver / CUDA toolkit version management, `nvidia-smi` / DCGM monitoring, nvidia-container-toolkit integration, troubleshooting XID / ECC errors and thermal throttling
- Experience or familiarity with integrating GPU resources into Slurm: GRES configuration, cgroup-based GPU isolation, user/job-level resource limits

Skills Required

  • 5+ years hands-on Linux systems administration and infrastructure operations
  • Solid Linux internals knowledge (process, memory, filesystem, networking, systemd, cgroups)
  • Strong Bash / Shell scripting skills
  • Programming ability for data processing, CLI tools, and API services
  • Python proficiency
  • Solid storage fundamentals (RAID, filesystem selection, snapshots, backup planning)
  • NAS / shared storage operations (NFS / SMB)
  • Experience with at least one major public cloud (AWS / GCP / Alibaba Cloud) and IaC tooling (Terraform / CDK / Ansible)
  • Containerization and orchestration experience (Docker, Kubernetes)
  • CI/CD pipeline design and operations experience (GitLab CI / Jenkins / Airflow) and operating self-hosted GitLab/Runner
  • Ability to own cross-service subsystem end-to-end: design, implementation, documentation, handoff
  • On-call for infrastructure, CI/CD, and production service incidents
  • Strong autonomy and self-direction; proactive problem identification and independent prioritization
  • HPC scheduler experience (Slurm / PBS / LSF)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Taipei City, Xinyi District
224 Employees
Year Founded: 2018

What We Do

Company Information: Kronos Research is a technology and science-driven trading firm. We risk firm capital, trading a broad range of financial instruments and strategies on global markets. The firm is willing to pursue new ideas and markets with conviction and as such, we have been at the forefront of technology and markets, quickly dominating newer markets such as applying HFT to cryptocurrency trading. Kronos trades on average more than $5 Billion USD / day just in crypto volume alone. Our team has grown quickly from 2 in 2018 to the now 80+ team at our Taipei headquarters and office in Shanghai with more locations on the way. Mission Statement: Together we work on meaningful goals and solve complex problems. Our team is defined by a high-performance culture that values collaboration and meritocracy. We like being challenged and don’t mind putting in the hard work. While the recognition and extremely competitive compensation make it a sweet place to work at, we are united fundamentally by our love and curiosity for innovation in financial markets and technology. The next phase of our company will be about becoming the best place for liquidity and using that position to create positive change. Our ultimate goal is to provide the infrastructure to make investing and trading easier and fairer for people around the world. One day, in any country, people will be able to invest in a Japanese stock, a German hedge fund, or a new cryptocurrency. Check out our current openings on: https://grnh.se/b00d843a3us

Similar Jobs

Remote
Taiwan
93 Employees
Remote
Taiwan
93 Employees
2M-3M Annually

Graphcore Logo Graphcore

Technical Sourcing Manager

Artificial Intelligence • Semiconductor
Remote or Hybrid
台北市
762 Employees
5-5 Annually

Snap Inc. Logo Snap Inc.

Manager, Product Design Engineering

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Remote or Hybrid
Taipei City, TWN
5000 Employees
11-11 Annually

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account