USCMS Researcher: Nick Manganelli



Postdoc dates: Jan 2023 - Jan 2025

Home Institution: University of Colorado Boulder


Project: Advancing Machine Learning Inference with Columnar Analysis at CMS Analysis Facilities

Develop and benchmark the usage of rapid Machine Learning Inference-as-a-Service together with columnar analysis in the Fermilab Elastic Analysis Facility for the next generation HL-LHC computing model.

More information: My project proposal

Mentors:
  • Keith Ulmer (University of Colorado Boulder)

Presentations
Current Status
Benchmarks have been completed of the EAF's Scalable Nvidia Triton System. Simple scaling tests show that the simple comparison of serial requests for inference of a two-class ParticleNet model attain speedups of approximately 50x using Triton on a 2.20GB MIG slice of an A100 GPU at the EAF, relative to a typical worker node in the LPC interactive nodes. While benchmarking, we discovered inefficiencies in the scaling of the Triton system, which resulted in too many servers spinning up for the number of requests being received. After tuning, we attained linear scaling of net inference rate with Triton server instances, indicating near-ideal scale-up parameters for the models under test. Tests of the basic networking characteristics demonstrate that the kinds of models oftend deployed in analysis now, such as BDTs, can easily see slowdown using the Triton server infrastructure, due to the overhead of transmitting the inputs and outputs over the network. More compute-heavy models, such as ResNet50 and the aforementioned ParticleNet model, however, see notable and significant (respectively) gain due to the balance of computational (inference) time and network transmission time. Another result from our research is that the current default Triton behaviour with regards to simultaneously handling multiple models is not ideal. When multiple models are receiving inference requests on a given Triton server instance, each one builds and fills a request queue in main RAM. Due to contention for resources, whether it's in main RAM, over the PCIe Bus, for the GPU RAM, or compute, we see that inference efficiency can drop significantly, by approximately a factor of 5. This indicates that a better behaviour to orchestrate, when multiple models are being requested and when multiple servers are available, is to concentrate model requests of the same type (and potentially the same computation engine, such as torch or tensorflow) in the same server. Such capability is currently reserved for Nvidia AI Enterprise customers, however. Regardless, Triton proves to be a highly-efficienty ML Inference service, which is easy to use at the LPC/EAF, and can support requests from hundreds of worker nodes in parallel. The usage of this service by LPC analysts should be highly encouraged, in order to preserve other GPU usage for other tasks, such as model training. The paper has been submitted to CSBS and the arXiv: ["Optimizing High Throughput Inference on Graph Neural Networks at Shared Computing Facilities with the NVIDIA Triton Inference Server"](https://arxiv.org/abs/2312.06838)

Contact me: