Deloitte Nvidia Solutions Architect in Chicago, Illinois
Deloitte is offering AI based infrastructure as a service (IaaS) as a "single-stop", end to end managed service for customers doing AV Development/Test and Simulation. We will be offering this service based on Nvidia's DGX/A00 Super-Pod reference design in on-prem or co-lo configurations.
A key part to building this practice will involve setting up an internal DGX Super-pod reference & training environment as a "first prototype" in our Deloitte Data Center with all the automation services 'built in" for offering this as a service (pay by the drip consumption) to Automotive customers.
6+ years' experience on DGX/Super POD, DGX A100 Compute nodes, Fabrics (Storage/Compute) , Management networks & Software (DeepOps), Key system software for optimizing GPU communications I/O and application performance,
4+ years' experience establishing storage management guidelines for RAM/NVMe (internal storage) and External high speed storage (DDN, Netapp..) allocation to optimize performance and cost of running varying data-sets and workloads
4+ years' experience in design, deployment, and operations of HPC production-grade environments leveraging both SLURM and Kubernetes clusters
Deep understanding of scale-out compute, networking and external storage architectures for optimizing performance and acceleration of AI/HPC workloads
Management Servers - infrastructure design & setup for enabling- user logins, provisioning (OS images & other internal infrastructure services for the pod), Work-load management (resource management and scheduling/orchestration), container mgmt. system monitors /logs
Operations /run-time optimization of A100 compute resources (MIG partitions) for varying workloads
Working experience in git, conda, pip, yum, apt, zypper, julia, npm and a multitude of other installation frameworks
Development of docker containers to process AI/ML/DL workloads in HPC environment.
Debugging code at all levels using gdb, strace, tcpdump, wireshark, and other tools to find the root cause of issues.
Familiarized with deep learning frameworks such as PyTorch, Tensorflow, and CuDNN to learn how to integrate technologies with MPI protocol libraries openmpi and mvapich2.
BE in computer, Masters or equivalent experience in Computer Architecture, Computer Science, Electrical Engineering or related field.
Ability to travel up to 50% on average, based on the work you do and the clients and industries/sectors you serve.
Limited Immigration sponsorship may be available
Proven experience deploying, upgrading, migrating, and driving user adoption of sophisticated enterprise scale systems.
Creating custom python based metrics and analytics solution to profile HPC and Hadoop
Creating custom reporting dashboards in grafana from prometheus kubernetes metrics.
Programming skills to build distributed storage and computer systems, backend services, microservices, and web technologies.
Well versed in agile methodology.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability or protected veteran status, or any other legally protected basis, in accordance with applicable law.