HPC Cluster Engineer II

Lisboa, Portugal | Other | Full-time | Partially remote

Apply

Overview:

A multinational conglomerate holding company with a special interest in areas such as Smart & Autonomous Cars, Mobility & Connected Services, Smart City & Energy Systems, Aerospace & Defence Innovation, Venture Building & Business Scaling Operations. The Group is headquartered in Brussels with presence in Japan, Greece, Cyprus & Portugal.

What will you do: 

  • Administration of HPC cluster for Computer Aided Engineering (CAE) and Render Cluster
  • Maintenance of in-house shell scripts
  • Failed computation investigation, problem determination, incident resolution, system support, co-ordination with vendor
  • L1/L2 support on the HPC cluster for the customer
  • Maintain application running on the cluster
  • Manage network aspects (DNS, DHCP, internet access, …) with Network Team
  • Perform daily monitoring, and ensure cluster high availability
  • Manage patching and upgrade of the managed environment
  • Monitor regular backup and ensure cluster high availability
  • Create long term environment management centralization
  • Collaborate with other technical team when required

Provide support when necessary for the customer’s project:

  • HPC Cluster migration to AWS Cloud

Secondary Tasks:

Support the customer when needed on the following (out of maintenance scope):

  • Maintain other servers as: ECU compiler server; Terrace server - Data synchronization support

As back-up of other team members: 

  • Administration of Linux based GPU HPC cluster for Artificial Intelligence (AI), VRED Rendering Cluster and HPC cluster for Computer Aided Engineering (CAE)
  • Manage patching of Linux systems, including offline systems
  • Installation and configuration of hardware, OS and software + tuning for all R&D Linux workstations
  • Support artificial intelligence engineers to setup development environment on GPU HPC
  • Support setup of a driving simulator based on real time OS
  • Ensure Linux environment match company security standards

What are we looking for?

  • Linux OS and Server knowledge 
  • Cluster management
  • Infrastructure administration
  • Virtualization knowledge
  • Storage solution understanding and operating
  • CAE application knowledge

Key words which are important in HPC systems:

  • Workload manager: Slurm, PBS
  • Parallel File system: Lustre, ceph, beegfs
  • HPC management tools: Bright or Nvidia, Xcat
  • AI words: gpu, docker, python
  • OS: Rhel, Ubuntu, Rocky Linux

What can you expect from us?

  • A permanent job contract for a long term project;
  • Tech equipment + SIM Card + personal smartphone;
  • Health and Life Insurance;
  • Social events and team buildings;
  • The commitment of letting you grow with us, and be rewarded accordingly;
  • A dynamic and young team that will be always there to support you;
  • Training in the latest technologies;
  • Coffee, fruits, snacks and a warm welcoming when you pass by the office.