Senior Software Engineer, Cloud Reliability

Reposted 16 Days Ago
Redwood City, CA, USA
Hybrid
175K-225K Annually
Senior level
Artificial Intelligence • Machine Learning • Database
The Role
The role involves ensuring the reliability and performance of distributed database systems, developing monitoring strategies, and automating operations in a cloud-native environment.
Summary Generated by Built In

Zilliz is a fast-growing startup developing the industry’s leading vector database for enterprise-grade AI. Founded by the engineers behind Milvus, the world’s most popular open-source vector database, the company builds next-generation database technologies to help organizations quickly create AI applications. On a mission to democratize AI, Zilliz is committed to simplifying data management for AI applications and making vector databases accessible to every organization.


We're entering our next phase of 10x growth; more customers, larger datasets, and far higher expectations for reliability. You'll join a small, fast-moving Cloud Platform team that operates large-scale, multi-cloud, distributed database systems in production. This is a high-ownership role for engineers who want to move fast, build automation instead of toil, and take real responsibility for production stability.

What you will do:

  • Own the reliability, availability, and production stability of Zilliz Cloud as we scale through the next stage of growth
  • Debug complex production issues across Kubernetes, cloud infrastructure, networking, storage, and distributed database systems
  • Build automation and diagnostic tooling; log analysis, alert correlation, incident investigation, runbook automation, and remediation workflows so problems get solved once, not repeatedly
  • Turn recurring incidents into reusable tools, automation, documentation, and product improvements
  • Improve observability across latency, availability, throughput, and resource efficiency
  • Partner with database and infrastructure engineers to make Zilliz Cloud more reliable, scalable, and automated

What we are looking for:

  • 3+ years building or operating production cloud systems, infrastructure platforms, database systems, or large-scale online services
  • Bachelor's degree in Computer Science, Software Engineering, or a related field, or equivalent practical experience
  • Strong hands-on experience with Kubernetes, Docker, and at least one major cloud platform (AWS, GCP, or Azure)
  • Solid understanding of distributed systems; availability, scalability, performance, failure recovery, and operational tradeoffs
  • Experience with distributed databases, storage systems, search systems, or large-scale online systems is a strong plus
  • Experience operating highly multi-tenant systems or large infrastructure fleets; thousands of nodes, clusters, tenants, or customer deployments is especially valuable
  • Familiarity with modern cloud operations tooling such as Terraform, Helm, Argo CD, Prometheus, Grafana, and CI/CD systems
  • Strong bias for action, and the drive to thrive in a fast-paced, rapidly scaling environment

How we operate:

  • High ownership: You own production reliability end-to-end. The whole system, not a slice of it. High autonomy, high trust, minimal process.
  • Fast and focused: We ship often and keep a high bar. This team suits engineers who want velocity and a steep growth curve over red tape.
  • Globally distributed: We work closely with our core engineering teams across APAC. Occasional early morning or evening syncs in exchange for an on-call setup designed around timezone coverage, not overnight pages.

Zilliz is an Equal Opportunity Employer and welcomes people from all backgrounds, experiences, abilities, and perspectives. All qualified applicants will receive consideration for employment regardless of race, color, national origin, religion, sexual orientation, gender, gender identity, age, physical disability, or length of time spent unemployed.

Skills Required

  • 4+ years of experience in site reliability engineering or similar roles
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems
  • Bachelor's degree or above in computer science or software engineering
  • Experience with Open Source Milvus Vector Database is nice to have
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Redwood City, CA
75 Employees
Year Founded: 2017

What We Do

Zilliz is a leading vector database company for production-ready AI. Built by the engineers who created Milvus, the world's most popular open-source vector database, Zilliz is on a mission to unleash data insights with AI. The company builds next-generation database technologies to help organizations rapidly create AI/ML applications, and unlock the potential of unstructured data. By taking the burden of complex data infrastructure management off of its users, Zilliz is committed to bringing the power of AI to every corporation, every organization, and every individual. Headquartered in San Francisco, Zilliz is backed by a number of prestigious investors, including Aramco's Prosperity7 Ventures, Temasek's Pavilion Capital, Hillhouse Capital, 5Y Capital, Yunqi Partners, Trustbridge Partners and others. Zilliz's technologies and products help over 1000 organizations worldwide easily create AI applications in various scenarios, including computer vision, image retrieval, video analysis, NLP, recommendation engines, targeted ads, customized search, smart chatbots, fraud detection, network security, new drug discovery, and much more. Learn more at zilliz.com or follow @zilliz_universe.

Similar Jobs

Lowe’s Logo Lowe’s

Account Manager

Consumer Web • eCommerce • Information Technology • Retail • Software • Analytics • App development
Hybrid
Hawthorne, CA, USA
300000 Employees
65K-105K Annually
Hybrid
Westwood Hills Neigborhood, Los Angeles, CA, USA
205000 Employees
37K-66K Hourly
Hybrid
Ontario, CA, USA
205000 Employees
25K-38K Hourly
Hybrid
Moreno Valley, CA, USA
205000 Employees
25K-38K Hourly

Similar Companies Hiring

Legora Thumbnail
Artificial Intelligence • Legal Tech • Software
Chicago, Illinois
700 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account