HPC Systems Engineer - Python/Kubernetes

Advanced Micro Devices, Inc.
$157,040.00/Yr.-$235,560.00/Yr.
United States, Texas, Austin
7171 Southwest Parkway (Show on map)
Apr 16, 2025
WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance_ THE TEAM AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people. THE ROLE: We are seeking an experienced HPC Systems Engineer with 7+ years of expertise in high-performance computing (HPC) environments. This role requires hands-on experience with Python, Kubernetes (K8s), Slurm, OpenStack, and Ansible, along with the ability to support external clients in live troubleshooting sessions. The PERSON: The ideal candidate will have deep technical knowledge of drivers, troubleshooting methods, and system-level debugging and will play a key role in managing, optimizing, and troubleshooting HPC clusters and cloud-based HPC environments. KEY RESPONSIBILITIES: HPC System Administration & Troubleshooting: Manage and optimize HPC clusters ensuring high availability and performance. Debug and resolve issues related to drivers, firmware, hardware failures, and network configurations. Troubleshoot storage, networking, and OS-level performance bottlenecks in both bare-metal and virtualized HPC setups. Cloud & Virtualized HPC Environments: Deploy and manage OpenStack-based HPC clusters for virtualized and hybrid HPC workloads. Optimize cloud infrastructure for scalability, workload orchestration, and performance. Work with Ceph, Cinder, Neutron, and Nova for cloud-based storage, networking, and compute management. Software Stack & Automation Management: Maintain and optimize Slurm-based job scheduling in HPC environments. Deploy and manage Kubernetes-based workloads for HPC and AI/ML applications. Automate provisioning, configuration, and monitoring of HPC resources using Ansible and Terraform. Develop and maintain Python scripts for automation, monitoring, and log analysis. Driver, Firmware & OS-Level Troubleshooting: Work with GPU, CPU, and network drivers, ensuring proper configuration and performance. Update and troubleshoot firmware issues across compute nodes, accelerators, and network switches. Ensure compatibility between OS, ROCm, CUDA, OpenMPI, UCX, and other HPC-related software stacks. Live Client Support & Troubleshooting: Provide real-time technical support to external customers and clients. Diagnose and resolve issues efficiently through live interactions and remote sessions. Document solutions and contribute to a knowledge base for future reference. System Integration, Storage, & Performance Optimization: Tune OS parameters, networking configurations, and Slurm scheduler settings for optimal performance. Benchmark and optimize multi-node, distributed computing applications. Implement best practices for containerized HPC workloads in Kubernetes and OpenStack. Work with high-speed interconnects (Infiniband, Slingshot, RoCE, etc.) for performance tuning. Manage and optimize local, distributed, and cloud-based storage solutions (Ceph, NFS, Lustre, BeeGFS). Automation & Infrastructure as Code (IaC): Implement Ansible playbooks for automated provisioning and configuration management. Optimize continuous integration and deployment (CI/CD) pipelines for HPC applications. Collaboration & Documentation: Work closely with developers, DevOps, and system administrators to maintain a stable environment. Create detailed documentation for troubleshooting guides, SOPs, and best practices. Contribute to internal wikis, training materials, and technical blogs. PREFERRED QUALIFICATIONS: Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM. Exposure to containerized workloads using Singularity or Docker in HPC. Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible). Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues. This role is critical in ensuring seamless HPC operations, troubleshooting complex system issues, and supporting high-profile clients with real-time problem resolution in both bare-metal and cloud-based HPC environments. ACADEMIC CREDENTIALS: Bachelor or Masters Degree in Computer Engineering or Electrical Engineering LOCATION - Austin, TX #LI-RW1 #LI-HYBRID At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD's Employee Stock Purchase Plan. You'll also be eligible for competitive benefits described in more detail here. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.