HPC Cluster Engineer

Lisboa, Portugal | Other | Full-time | Partially remote

Apply

Overview:

A multinational conglomerate holding company with a special interest in areas such as Smart & Autonomous Cars, Mobility & Connected Services, Smart City & Energy Systems, Aerospace & Defence Innovation, Venture Building & Business Scaling Operations. The Group is headquartered in Brussels with presence in Japan, Greece, Cyprus & Portugal.

What will you do:

  • Administration of Linux based GPU HPC cluster for Artificial Intelligence (AI)
  • Support VRED Rendering Cluster and HPC cluster for Computer Aided Engineering (CAE)
  • Maintenance of in-house shell scripts
  • Failed computation investigation, problem determination, incident resolution, system support, co-ordination with vendor
  • Support and educate users with no Linux experience
  • Installation and configuration of hardware, OS and software + tuning for all R&D Linux work-stations
  • Manage patching of Linux systems, including offline systems
  • Manage network aspects (DNS, DHCP, internet access, …) with Network Team
  • Perform daily monitoring, management of the backup environment (Ceph) and ensure cluster high availability
  • Support artificial intelligence engineers to setup development environment on GPU HPC
  • Create long term environment management centralization
  • Support setup of a driving simulator based on real time OS
  • Support setup of integrated engineering development environment: Linux laptops, office work- stations, in-car computer
  • Setup patching environment, including workstations without internet connection
  • Ensure Linux environment match company security standards
  • Collaborate with other technical teams and integrate Linux workstations in AD domain
  • Deploying/Maintaining AWS AI Cluster.
  • Supporting AWS VRED and CAE clusters

As back-up of other members:

  • Administration of HPC cluster for Computer Aided Engineering (CAE)
  • L1/L2 support on the HPC cluster for the customer
  • Maintain application running on the cluster
  • Manage patching and upgrade of the managed environment
  • Monitor regular backup and ensure cluster high availability

What are we looking for?

  • Linux OS and Server knowledge
  • Cluster management
  • Infrastructure administration
  • Virtualization knowledge
  • Storage solution understanding and operating
  • AI understanding is an asset

Key words which are important in HPC systems.

  • Workload manager: Slurm, PBS Parallel
  • File system: Lustre, ceph, beegfs
  • HPC management tools: Bright or Nvidia, Xcat
  • AI words: gpu, docker, python
  • OS: Rhel, Ubuntu, Rocky Linux

What can you expect from us?

  • A permanent job contract for a long term project;
  • Tech equipment + SIM Card + personal smartphone;
  • Health and Life Insurance;
  • Social events and team buildings;
  • The commitment of letting you grow with us, and be rewarded accordingly;
  • A dynamic and young team that will be always there to support you;
  • Training in the latest technologies;
  • Coffee, fruits, snacks and a warm welcoming when you pass by the office.