New
TPM GPU Operations Standards and Quality
Microsoft | |
United States, Washington, Redmond | |
Oct 29, 2025 | |
|
OverviewMicrosoft's Cloud Operations & Innovation (CO+I) powers cloud services by ensuring datacenter availability and operational continuity. The Global IT Service Transition team standardizes processes so new sites and IT support teams can achieve Day 1 Operational Readiness efficiently. The Technical Program Manager (TPM) for GPU Operations Standards and Quality leads the development and enforcement of operational standards, quality assurance, and readiness for GPU deployments. This role partners across engineering, supply chain, and operations to ensure GPU deployments meet regulatory, security, and performance standards, enabling scalable and reliable operations.
ResponsibilitiesResponsibilities:Align with Microsoft's culture, objectives and Datacenter Operational policies and standards.Deliver a best-in-class, new service transition and onboarding program to achieve site & operational readiness.Define and implement GPU operational standards across deployment, servicing, and lifecycle management.Drive cross-functional programs to define, implement, and validate GPU compliance standards across global datacenter environments.Partner with engineering, supply chain, and operations teams to ensure GPU hardware and software configurations meet internal and external compliance requirements.Lead risk assessments and mitigation strategies related to GPU deployments, including site operational readiness and scaled growth.Develop and maintain documentation for GPU standards, audit procedures, and compliance tracking.Represent CO+I in industry forums and regulatory engagements related to GPU infrastructure.Establish KPIs and reporting mechanisms to monitor compliance health and drive continuous improvement.Lead quality assurance initiatives to ensure compliance with performance, reliability, and safety benchmarks.Develop and maintain readiness scorecards and validation frameworks for GPU infrastructure.Coordinate cross-functional efforts across hardware, serviceability, and tooling teams.Manage escalations, fault code governance, and exception handling for GPU-related incidents.Drive continuous improvement through data-driven insights and stakeholder feedback.Evolve operational excellence with key focus areas of risk management, uptime availability and safety.Build strong working relationships and engagement with our Engineering, Procurement & Construction (EPC) teams, support and tooling partners.Establish operational representation through design, build, commissioning and turnover project phases, as required.Create an environment to promote learning and innovation opportunities. | |
Oct 29, 2025