HPC Cluster Engineer at Caixa Mágica Software

See all the jobs at Caixa Mágica Software here: http://caixamagica.recruiterbox.com/jobs

HPC Cluster Engineer

Lisboa, Portugal | Other | Full-time | Partially remote

Overview:

A multinational conglomerate holding company with a special interest in areas such as Smart & Autonomous Cars, Mobility & Connected Services, Smart City & Energy Systems, Aerospace & Defence Innovation, Venture Building & Business Scaling Operations. The Group is headquartered in Brussels with presence in Japan, Greece, Cyprus & Portugal.

What will you do:

Administration of Linux based GPU HPC cluster for Artificial Intelligence (AI)
Support VRED Rendering Cluster and HPC cluster for Computer Aided Engineering (CAE)
Maintenance of in-house shell scripts
Failed computation investigation, problem determination, incident resolution, system support, co-ordination with vendor
Support and educate users with no Linux experience
Installation and configuration of hardware, OS and software + tuning for all R&D Linux work-stations
Manage patching of Linux systems, including offline systems
Manage network aspects (DNS, DHCP, internet access, …) with Network Team
Perform daily monitoring, management of the backup environment (Ceph) and ensure cluster high availability
Support artificial intelligence engineers to setup development environment on GPU HPC
Create long term environment management centralization
Support setup of a driving simulator based on real time OS
Support setup of integrated engineering development environment: Linux laptops, office work- stations, in-car computer
Setup patching environment, including workstations without internet connection
Ensure Linux environment match company security standards
Collaborate with other technical teams and integrate Linux workstations in AD domain
Deploying/Maintaining AWS AI Cluster.
Supporting AWS VRED and CAE clusters

As back-up of other members:

Administration of HPC cluster for Computer Aided Engineering (CAE)
L1/L2 support on the HPC cluster for the customer
Maintain application running on the cluster
Manage patching and upgrade of the managed environment
Monitor regular backup and ensure cluster high availability

What are we looking for?

Linux OS and Server knowledge
Cluster management
Infrastructure administration
Virtualization knowledge
Storage solution understanding and operating
AI understanding is an asset

Key words which are important in HPC systems.

Workload manager: Slurm, PBS Parallel
File system: Lustre, ceph, beegfs
HPC management tools: Bright or Nvidia, Xcat
AI words: gpu, docker, python
OS: Rhel, Ubuntu, Rocky Linux

What can you expect from us?

A permanent job contract for a long term project;
Tech equipment + SIM Card + personal smartphone;
Health and Life Insurance;
Social events and team buildings;
The commitment of letting you grow with us, and be rewarded accordingly;
A dynamic and young team that will be always there to support you;
Training in the latest technologies;
Coffee, fruits, snacks and a warm welcoming when you pass by the office.

See all the jobs at Caixa Mágica Software here: http://caixamagica.recruiterbox.com/jobs

Apply for this opening at ?apply=true

See all the jobs at Caixa Mágica Software here: http://caixamagica.recruiterbox.com/jobs

Application Form