Lead design and development of AI-native Redis-family core systems, including distributed KV caching, persistent memory storage, LLM KV cache infrastructure, and AI-aware memory services. Build planet-scale reliability with high availability, disaster recovery, multi-AZ/multi-region strategies, and large-scale stability engineering for always-on workloads. Architect and optimize multi-tier memory systems (in-memory / SSD / shared storage) to reduce read/write amplification and improve tail latency under extreme concurrency. Build a production-grade ecosystem with automated orchestration, monitoring, incident response runbooks, and capabilities like bulkload, backup & restore, point-in-time recovery, and tiered storage; explore AI+DB technologies (ZNS SSD, io_uring, RDMA/CXL) in production.
About the Team Join ByteDance's Redis Family team, where we build and operate AI-native distributed KV caching and memory systems powering ByteDance's global infrastructure. Beyond traditional caching, we are evolving toward a unified Memory Infrastructure Layer that supports high-performance Redis-compatible KV systems, persistent and tiered storage engines, LLM KV Cache acceleration infrastructure, and AI-aware memory services. Our systems serve mission-critical scenarios at massive scale — recommendation, search, ads, e-commerce, messaging, live streaming, and emerging AI-native applications — with strict requirements on availability, latency, throughput, global deployment, and cost efficiency. Responsibilities - Design and develop next-generation Redis Family core systems, including distributed KV caching, persistent memory storage, LLM KV cache infrastructure, and AI-aware memory services. - Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads. - Architect and optimize multi-tier memory systems (in-memory / SSD / shared storage), reducing read/write amplification and improving tail latency under extreme concurrency. - Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks). - Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems. - Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.