Linux/hpc Systems Engineer Resume
3.00/5 (Submit Your Rating)
SUMMARY
- Linux HPC and infrastructure engineer with 6 years of experience building and administering scalable, robust computing environments. Passionate about new technologies, with professional experience engineering and administering low - latency environments, petabyte scale filesystems, and critical compute clusters from soup to nuts.
- Ability to automate daily administration tasks using a variety of tools such as writing and modifying scripts, scheduling cron jobs, and writing Ansible playbooks.
- Strong skills in network services TCP/IP, NFS, DNS, LDAP, DHCP, SAMBA, YUM, SSH, etc.
- Experience using Ansible as IaC to handle automated agentless configuration of new servers.
- In depth experience configuring and administering Slurm and SGE/OGE job schedulers.
- Manage, install, and configure CentOS, RHEL 5.x, 6.x, 7.x, ESXI 5.x, 6.0, vCenter, etc.
- Strong troubleshooting skills to monitor server performance, hardware, network related issues.
- Strong technical background and knowledge of system security services eg. selinux and iptables.
- Working knowledge of GPFS administration of petabyte scale parallel filesystems.
- Experience running oVirt/RHEV cluster to deploy and configure VM’s on shared volumes.
- Close familiarity with Infiniband fabrics on software and hardware layers.
- Experience running many HPC optimized programs such as Spack, Singularity, and more.
- Automatically deploy installation and configuration of new servers using kickstart and PXEboot.
- Ability to write bash scripts to greatly increase cluster efficiency, monitor workloads, etc.
- Skilled in compiling tricky, open-source softwares utilizing shared libraries on nfs mounts to run across clusters with minimal disk overhead.
- Knowledge of apache/nginx configurations. Ability to incorporate modules to ensure security, reverse proxy to internal apps, limit specific IP ranges, etc.
- Knowledge of AWS administration; including deploying and configuring EC2 instances and S3.
- Ability to handle pressure and to learn and adapt quickly in a fast-paced environment.
- Excellent written and verbal communication skills.
TECHNICAL SKILLS
Operating systems: CentOS, RHEL, ESXI, Ubuntu, CoreOS, Windows
Storage: GPFS, Lustre, IBM WOS, SFA
Networking: TCP/IP, Infiniband, RDMA, NFS, DNS, NTP, HTTP/S, SMB, DHCP, TFTP
Software: Ansible, Slurm, SGE/OGE/UGE, Apache, Spack, Nagios, Check MK, Git, Spacewalk/Satellite, Kubernetes, Cryo-EM tools, MySQL, PostgreSQL, Docker, Relion, Restic, RStudio Server, Shiny, Cryosparc, oVirt/RHEV
Cloud: AWS, Wasabi, Box, Code42
Programming: Bash, Ansible
PROFESSIONAL EXPERIENCE
Confidential
Linux/HPC Systems Engineer
Responsibilities:
- Manage and perform all systems and automation tasks in complex HPC infrastructure consisting of thousands of compute nodes, schedulers, management servers, and critical GPFS filesystems.
- Assist users with SGE and Slurm scheduling errors, configure schedulers from scratch.
- Write public-facing documentation and user tutorials on Slurm, scientific programs, etc.
- Perform general GPFS administration of 4.8PB in total cluster space; including adding nodes to GPFS cluster, setting disk and inode quotas, analyzing high waiters, troubleshooting errors, etc.
- Write scripts to accomplish wide ranges of tasks, including custom scripts for job monitoring, memory overallocations, automatic priority share modifications, custom squeue output, etc.
- Write internal documentation for common issues, data restoration processes, etc.
- Assist users in all technical issues; from permission/access issues, to compilation errors, etc.
- Tailor fairshare to provide near-equitable distribution of resources.
- Manage apache web servers, proxies, SSL certificates. Achieving A+ Qualys SSLLabs ratings.
- Meet with vendors to identify and weigh new technologies in the HPC ecosystem.
- Write alerts that parse GPFS commands to identify offenders of the parallel filesystem.
- Build GPU cluster and configure new Slurm scheduler for hardware comprising of around 50TB in total RAM, hundreds of GPU’s, thousands of physical cores; configured hardware, gres, etc.
- Write wrapper scripts around slurm utilities to identify memory offenders, cluster/job statuses, etc.
- Install IB cards, drivers, cables. Identify bottlenecks by creating IB network topology model.
- Create low latency mounts utilizing RDMA over IB.
- Configure cgroups on non-compute machines for protection from intensive programs users mistakenly run on submit hosts/gateway machines.
- Quickly learn and document the compilation and administration processes of hundreds of new scientific softwares to assist scientists when needed.
- Perform all SSL cert updates, including migrating all 30+ sites to new AddTrust root bundle
- Configure RStudio Server Pro, Shiny apps; configure to authenticate against local ldap by group.
- Write ansible scripts to automate many server deployments; including total node deployments, apache servers, SGE clients, Slurm scheduler and clients, SSSD/LDAP integration, etc.
- Create new PXEBoot/tftp/dhcp server. Add support for UEFI bootloaders.
- Deploy GitLab server with support for CI/CD, shared runners, DTR, LDAP integration, etc.
- Deploy spacewalk server to display errata, roll out patches, display power consumption, etc.
- Configure fail2ban to ban malicious ssh attacks/logins, apache noscript, badbots, overflows, etc.
- Monitor InfiniBand traffic, topology; remapped cables to eliminate bottlenecks.
- Recompile Mellanox drivers as kernel version updates.
- Manage iptables rules, ensure security and network segmentation of critical servers.
- Ensure systems are being backed up using Amanda, add new servers to disk list, open necessary ports, and initialize dry run backup to ensure operational status.
- Assist users with compiling errors in R, adding support for multiple versions of R, Python, and many other scientific softwares using Spack over shared NFS mounts.
- Create CoreOS/Terraform Kubernetes cluster for automating DNA/RNA sequencing workflows.
- Manage four large compute clusters, each with own SGE (2) or Slurm (2) scheduler configuration.
- Deploy apache web servers with SSL termination to backend internal apps.
- Create and manage apache virtual hosts, ssl certificates to host multiple ssl domains under 1 IP
- Create/manage autofs with ldap/sssd. Define mount points, rules, homes, etc in ldap DB.
- Deploy all-in-one AWS Cryo-EM tool to greatly simplify submitting Cryo-EM jobs to AWS GPUs
- Investigate hardware errors on nodes, resolve hardware issues if not under warranty.
- Create websites and add html template for researchers to publish their findings, etc.
- Configure license servers for multiple cluster-wide software vendor applications.
- Install drivers for various Nvidia GPU’s; install cudNN drivers in central repo for version selection.
- Configure IDRAC for servers to remotely login to console. Update BIOS, etc from LFC.
- Configure Globus for fast cross-institutional data transfers utilizing 10Gb internet.
- Install internal chrooted DNS server; create DNS records with the addition of new servers
- Configure MiSeq instruments to spool to network samba share for automated pipeline runs.
- Resolve a wide array of user issues including errors from compilation, tunneling, schedulers, etc.
- Install and configure packages unto Spack, a package manager designed for HPC use cases.
- Deploy and configure Singularity containers, a container solution for HPC use cases.
- Rack servers, troubleshoot boot errors, PXE errors and more hands-on tasks.
- Install various Tensorflow versions in Python virtual environments for cluster-wide access.
- Configured Nagios and Check MK to monitor servers’ status, send e-mail alerts, etc.
- Incorporate logs into institutional Splunk to keep track of log files of crucial servers.
- Configure PDU’s to monitor power consumption and confirm amperage balance over circuits.
- Coordinate with institutional technology department to open ports and provide justification, unban users banned by their IPS, grant public DNS records, request SSL certificates, etc.
Confidential, Jersey City, NJ
Linux Systems Engineer
Responsibilities:
- Handle day-to-day operations; install software, provision systems, apply patches, manage file systems, monitor performance, troubleshoot alerts, and resolve tickets.
- Create, extend, reduce logical volumes and filesystems on ext2, ext3, ext4, xfs filesystems.
- Experience in implementation and maintenance of VMware, DNS, DHCP, SMB, NFS and SMTP.
- Setup NFS and Samba File sharing setup on Linux for Windows servers and machines.
- Configure and troubleshoot local YUM repository on network wide system using NFS.
- Configure LAMP and LEMP stack servers for web server and load balancing deployments
- Assess and resolve ESM (Enterprise Security Management) violations to ensure compliance.
- Create VMs on ESXi hosts, deploy Linux operating systems, and add objects to datastores.
- Seamlessly migrate VMs to different hosts and datacenters with no downtime using vMotion.
- Experience writing/modifying shell scripts for process automation of systems, backups, etc.
- Daily system monitoring using Nagios monitoring, resolve/escalate reported issues.
- Regularly monitor system activities like CPU, memory, disk, swap, etc.
- Document problems and solutions of issues to assist in future analysis and troubleshooting.
- Perform routine system hardening and monitoring. Including restricting vulnerable services, blocking unnecessary ports, disabling root logins, monitoring new vulnerabilities, etc.
- Enable IP Tables firewall rules, disabled services and blocked ports to ensure system security.
- Work with ILO, IDRAC software to remotely manage and troubleshoot servers
Confidential
Linux Systems Administrator
Responsibilities:
- Administer and support CentOS and RHEL servers. Install and upgrade OS and packages.
- Experience configuring user and group accounts in local and LDAP environments.
- Perform routine disk managements including partitioning, mounting, and troubleshooting disks.
- Maintain and configure ESXI host servers, created, cloned, and deployed VMs to other hosts.
- Configure and troubleshoot NIS, LDAP, DNS, DHCP, VSFTP, SSH, and NFS environments.
- Ensure security of systems, monitor iptables rules, selinux, and resolve vulnerabilities.