Deloitte Autonomous Vehicle Infrastructure Systems Lead, Manager - Managed AI in San Jose, California
Autonomous Vehicle Infrastructure Systems Lead, Manager - Managed AI
The Deloitte Connected and Autonomous Vehicle (CAV) team is catalyzing and shaping the Autonomous Vehicle (AV) market through a suite of turnkey, as-a-service solutions that deliver improved performance and lower total cost of ownership. These solutions will empower Automotive customers to realize their autonomy ambitions as efficiently as possible.
High Level Role
We are looking for a seasoned, "hands-on" HPC/AI infrastructure systems leader who will drive the scope, detailed design, and deployment of AV infrastructure across on-prem, cloud, and hybrid environments. The key success measure of this prototype will be the delivery of Deloitte's offering in POD configurations as a service for our customers with guaranteed SLAs and TCO targets.
Establish the detailed specification of the DGX A100 that reflects a representative customer's planning, deployment, and on-going operations optimization requirements on TCO, throughput, scalability, and flexibility with their varied workloads
Set up the DGX/Super POD reference environment including DGX A100 compute nodes, fabrics (storage/compute), management networks & software (DeepOps), key system software for optimizing GPU communications I/O and application performance, and user run-time tools for SLURM and Kubernetes containers
Design and document the most efficient setup to meet success metrics (TCO, performance, scale). Specific areas of focus:
Network switch & fabric considerations for non-blocking, scalable bandwidth needs for best performance with varying dataset sizes & locations
Storage and caching hierarchy implementations based on training vs inferencing workloads. Establish storage management guidelines for RAM/NVMe (internal storage) and external high speed storage (DDN, Netapp, etc.) allocation to optimize performance and cost of running varying data-sets and workloads. Establish rules for when to trigger GPU Direct Storage (GDS) feature for lower latency and faster I/O workloads.
Management Servers - infrastructure design & setup for enabling- user logins, provisioning (OS images & other internal infrastructure services for the pod), Work-load management (resource management and scheduling/orchestration), container mgmt., system monitors/logs
Operations/run-time optimization of A100 compute resources (MIG partitions) for varying workloads to maximize the utilization and throughput of jobs being scheduled in a given node cluster
Validate the commercial model with the MVP operational run/playbook
Bachelor's Degree equivalent experience in Computer Architecture, Computer Science, Electrical Engineering or related field. Advanced degree preferred
6+ years of proven experience in design, deployment, and operations of HPC production grade environments leveraging both SLURM and Kubernetes clusters
Deep understanding of scale out compute, networking, and external storage architectures for optimizing performance and acceleration of AI/HPC workloads
Proven experience deploying, upgrading, migrating, and driving user adoption of sophisticated enterprise scale systems.
Prior software, solutions development background and proven ability to demonstrate complex new technologies
Programming skills to build distributed storage and compute systems, backend services, microservices, and web technologies
Well versed in agile methodology
Comfortable with a customer focused, high paced environment
Ability to travel up to 50% on average, based on the work you do and the clients and industries/sectors you serve
Limited immigration sponsorship may be available