LOG IN
SIGN UP
Tech Job Finder - Find Software, Technology Sales and Product Manager Jobs.
Sign In
OR continue with e-mail and password
E-mail address
Password
Don't have an account?
Reset password
Join Tech Job Finder
OR continue with e-mail and password
E-mail address
First name
Last name
Username
Password
Confirm Password
How did you hear about us?
By signing up, you agree to our Terms & Conditions and Privacy Policy.

Deep Learning Kernel Software Performance Architect

at Nvidia

Back to all Python jobs
N
Industry not specified

Deep Learning Kernel Software Performance Architect

at Nvidia

Tech LeadNo visa sponsorshipPython

Posted 9 hours ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Shanghai
Country
China

NVIDIA seeks a Software Performance Architect to optimize GPU kernel performance for data-center platforms and build automated, data-driven workflows to detect and explain performance regressions. The role involves performance analysis, debugging and driving fixes with kernel, compiler, and infrastructure teams, plus designing scalable regression automation and reporting pipelines. You will convert raw run outputs into actionable insights, maintain CI/CD/ regression workflows, and collaborate across teams to ensure performance checks are practical and release-aligned. Strong Python and C/C++ skills and a solid background in computer architecture and performance reasoning are essential.

NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.

What you'll be doing:

  • Performance analysis + debugging

    • Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.

    • Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.

    • Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.

  • Automation + regression infrastructure (Python-heavy)

    • Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.

    • Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.

    • Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.

  • Cross-team collaboration and operating model

    • Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.

    • Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.

    • Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently

  • Following general software engineering best practices including support for regression testing and CI/CD flows

What we need to see:

  • Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field

  • Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)

  • Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).

  • Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.

  • Comfortable working across teams and driving issues to decision/closure with clear communication

  • Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design

  • Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)

  • Solid understanding of computer architecture and some experience with assembly programming

  • Identify bottlenecks, optimize resource utilization, and improve throughput

Ways to stand out from the crowd:

  • Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)

  • Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics

  • GPU programming/perf experience (CUDA or equivalent parallel programming)

  • Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)

  • Familiarity with simulators/analytical modeling or performance characterization methodology

Deep Learning Kernel Software Performance Architect

at Nvidia

Back to all Python jobs
N
Industry not specified

Deep Learning Kernel Software Performance Architect

at Nvidia

Tech LeadNo visa sponsorshipPython

Posted 9 hours ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Shanghai
Country
China

NVIDIA seeks a Software Performance Architect to optimize GPU kernel performance for data-center platforms and build automated, data-driven workflows to detect and explain performance regressions. The role involves performance analysis, debugging and driving fixes with kernel, compiler, and infrastructure teams, plus designing scalable regression automation and reporting pipelines. You will convert raw run outputs into actionable insights, maintain CI/CD/ regression workflows, and collaborate across teams to ensure performance checks are practical and release-aligned. Strong Python and C/C++ skills and a solid background in computer architecture and performance reasoning are essential.

NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.

What you'll be doing:

  • Performance analysis + debugging

    • Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.

    • Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.

    • Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.

  • Automation + regression infrastructure (Python-heavy)

    • Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.

    • Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.

    • Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.

  • Cross-team collaboration and operating model

    • Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.

    • Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.

    • Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently

  • Following general software engineering best practices including support for regression testing and CI/CD flows

What we need to see:

  • Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field

  • Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)

  • Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).

  • Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.

  • Comfortable working across teams and driving issues to decision/closure with clear communication

  • Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design

  • Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)

  • Solid understanding of computer architecture and some experience with assembly programming

  • Identify bottlenecks, optimize resource utilization, and improve throughput

Ways to stand out from the crowd:

  • Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)

  • Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics

  • GPU programming/perf experience (CUDA or equivalent parallel programming)

  • Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)

  • Familiarity with simulators/analytical modeling or performance characterization methodology

SIMILAR OPPORTUNITIES

No similar jobs available at the moment.