Menu

Job Opportunity

Open Position

Senior HPC Systems Engineer

Support the computing infrastructure behind the AI Computing Resource (AICR) that supports the Massachusetts AI Hub. This hands-on role will be responsible for deploying, maintaining, and optimizing HPC clusters, storage systems, and networking for AI/ML workloads. Join a collaborative, fast-paced team delivering critical infrastructure to some of the nation’s leading AI researchers.

Apply for this job  (#25342)

Pay Range

$120,440 - $163,200

Position Overview

As a technical leader, the HPC Systems Engineer will manage and optimize high-performance computing environments, ensuring the smooth operation of complex clusters, and providing excellent support to end-users.  Responsible for the installation, configuration, maintenance, and troubleshooting of HPC systems and infrastructure. Collaborate with researchers, engineers, and other IT professionals to ensure the HPC environment is secure, efficient, and reliable.

Principal Responsibilities

  • Manage infrastructure, support, maintenance, and operations of systems, hardware, and software.
  • Install, configure, and maintain HPC hardware and software, including clusters, storage systems, and job schedulers.
  • Monitor system performance, identify bottlenecks, and implement optimizations to improve efficiency and resource utilization.
  • Provide technical support to end-users, including troubleshooting hardware and software issues.
  • Perform system backups and disaster recovery procedures to ensure data integrity and availability.
  • Develop and maintain system documentation, including installation guides, configuration files, and standard operating procedures.
  • Assist in the design and deployment of scalable HPC environments to meet the needs of scientific and engineering applications.
  • Collaborate with teams to identify and implement software and hardware upgrades.
  • Ensure security of HPC systems by applying patches, configuring firewalls, and performing regular audits.
  • Conduct performance benchmarking, diagnostics, and capacity planning to keep systems up to date with evolving needs.
  • Perform other duties as required.

Supervision Received

  • This position reports to the Executive Director, AI Computing Resource (AICR)

Supervision Exercised

  • None

Employment Type

  • Full-Time, Hybrid (primarily remote with occasional on-site)

Qualifications & Skills

Required

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).
  • Minimum 7 years relevant experience required.
  • Proven experience as an HPC System Engineer or similar role in high-performance computing environments
  • In-depth knowledge of HPC architectures, Linux/Unix operating systems, and cluster management tools.
  • Experience with parallel programming, MPI, job schedulers, and batch processing systems.
  • Strong knowledge of storage systems (e.g., Lustre, GPFS) and network file systems.
  • Experience with containerization and virtualization technologies (e.g., Docker, Kubernetes).
  • Strong troubleshooting skills
  • Excellent communication skills, with the ability to interact with technical and non-technical stakeholders.

Preferred

  • Experience with cloud-based HPC environments (e.g., AWS, Google Cloud).
  • Familiarity with GPU-based computing and relevant software (e.g., CUDA).
  • Knowledge of programming languages such as Python, Bash, or Perl.
  • Experience with monitoring and alerting tools (e.g., Nagios, Prometheus).
  • Experience with networking and network management tools

Research projects

The US ATLAS Northeast Tier 2 Center
Yale Budget Lab
Volcanic Eruptions Impact on Stratospheric Chemistry & Ozone
Towards a Whole Brain Cellular Atlas
Tornado Path Detection
The Kempner Institute - Unlocking Intelligence
The Institute for Experiential AI
Taming the Energy Appetite of AI Models
All Research Projects

Collaborative projects

ALL Collaborative PROJECTS

OUTREACH & EDUCATION PROJECTS

See ALL Scholarships
100 Bigelow Street, Holyoke, MA 01040