AI Infrastructure Engineer
职位亮点
职位描述
Responsibilities
1. Design and plan compute systems for AI and large-scale model workloads, including GPU clusters, distributed training platforms, and high-performance computing (HPC) architectures.
2. Evaluate and select GPUs (e.g., NVIDIA, AMD, Huawei Ascend, domestic AI accelerators) based on business needs, and propose optimal hardware configurations.
3. Design large-scale training cluster topologies—covering compute, storage, networking, and cooling—to balance performance, power efficiency, and cost.
4. Participate in building data center–level compute resource pools, including GPU virtualization, containerization, and scheduling optimization.
5. Track cutting-edge GPU and AI accelerator technologies and propose timely system upgrades and optimizations.
6. Collaborate with R&D teams to plan and tune compute resources for large-model training and inference.
7. Author and deliver design documents, technical specifications, and implementation plans.
Qualifications
· Bachelor’s degree or above in Computer Science, Electrical Engineering, Communications, or related fields.
· Deep understanding of mainstream GPU architectures and ecosystems (e.g., CUDA, ROCm, Ascend CANN); familiarity with performance characteristics and use cases of multiple GPU product lines.
· Experience in compute planning for training/inference of large AI models (e.g., Transformer family, LLMs, Diffusion models).
· Familiarity with HPC architectures, distributed training frameworks (e.g., Megatron-LM, DeepSpeed, Horovod), and high-performance interconnects (e.g., InfiniBand, NVLink, PCIe Gen5).
· Understanding of data center power, cooling, and facility planning; hands-on experience in compute cluster deployment or optimization is a plus.
· Strong system architecture skills with the ability to balance performance, cost, and energy consumption.
· Excellent cross-team communication skills, able to collaborate effectively with R&D, operations, and supply chain teams.
Preferred Qualifications
· Experience designing or operating ultra-large-scale GPU clusters (1,000+ GPUs).
· Participation in large-scale model training projects with practical insights into compute bottlenecks and optimization techniques.
· Familiarity with domestic GPUs/AI accelerators and their ecosystems.
· Proven track record in HPC, cloud computing, or data center architecture design.
Interested parties please send your full resume with current & expected salary by email to hr@cmi.chinamobile.com. Please indicate the reference number in the subject line.
All personal data provided will be used for consideration of your job application only.
| 工作种类 | |
| 工作地区 | 不指定 |