See all the jobs at Caixa Mágica Software here:
HPC Cluster Engineer
| Other | Full-time | Partially remote
,Overview:
A multinational conglomerate holding company with a special interest in areas such as Smart & Autonomous Cars, Mobility & Connected Services, Smart City & Energy Systems, Aerospace & Defence Innovation, Venture Building & Business Scaling Operations. The Group is headquartered in Brussels with presence in Japan, Greece, Cyprus & Portugal.
What will you do:
- Administration of Linux based GPU HPC cluster for Artificial Intelligence (AI)
- Support VRED Rendering Cluster and HPC cluster for Computer Aided Engineering (CAE)
- Maintenance of in-house shell scripts
- Failed computation investigation, problem determination, incident resolution, system support, co-ordination with vendor
- Support and educate users with no Linux experience
- Installation and configuration of hardware, OS and software + tuning for all R&D Linux work-stations
- Manage patching of Linux systems, including offline systems
- Manage network aspects (DNS, DHCP, internet access, …) with Network Team
- Perform daily monitoring, management of the backup environment (Ceph) and ensure cluster high availability
- Support artificial intelligence engineers to setup development environment on GPU HPC
- Create long term environment management centralization
- Support setup of a driving simulator based on real time OS
- Support setup of integrated engineering development environment: Linux laptops, office work- stations, in-car computer
- Setup patching environment, including workstations without internet connection
- Ensure Linux environment match company security standards
- Collaborate with other technical teams and integrate Linux workstations in AD domain
- Deploying/Maintaining AWS AI Cluster.
- Supporting AWS VRED and CAE clusters
As back-up of other members:
- Administration of HPC cluster for Computer Aided Engineering (CAE)
- L1/L2 support on the HPC cluster for the customer
- Maintain application running on the cluster
- Manage patching and upgrade of the managed environment
- Monitor regular backup and ensure cluster high availability
What are we looking for?
- Linux OS and Server knowledge
- Cluster management
- Infrastructure administration
- Virtualization knowledge
- Storage solution understanding and operating
- AI understanding is an asset
Key words which are important in HPC systems.
- Workload manager: Slurm, PBS Parallel
- File system: Lustre, ceph, beegfs
- HPC management tools: Bright or Nvidia, Xcat
- AI words: gpu, docker, python
- OS: Rhel, Ubuntu, Rocky Linux
What can you expect from us?
- A permanent job contract for a long term project;
- Tech equipment + SIM Card + personal smartphone;
- Health and Life Insurance;
- Social events and team buildings;
- The commitment of letting you grow with us, and be rewarded accordingly;
- A dynamic and young team that will be always there to support you;
- Training in the latest technologies;
- Coffee, fruits, snacks and a warm welcoming when you pass by the office.