USCMS Researcher: Jin Zhou



Postdoc dates: Aug 2024 - Sep 2025

Home Institution: University of Notre Dame


Project: Scalable Data Analysis Applications for High Energy Physics

- Accelerate CMS analysis workflows, focusing on those using Coffea, Dask, and TaskVine. - Reduce storage usage in data-intensive workflows to support more ambitious computations. - Improve fault tolerance on unreliable clusters through replication and checkpointing. - Explore graph optimization strategies to minimize makespan using real-time information.

More information: My project proposal

Mentors:
  • Douglas Thain (Cooperative Computing Lab, University of Notre Dame)

  • Kevin Lannon (Physics department, University of Notre Dame)

Presentations

Current Status


2025 Q1

  • Progress
    • Developed the large-input first (LIF) algorithm and the pruning algorithm which effectively reduce the storage consumption by over 90% while running hundreds of thousands of tasks.
    • Enhanced the resource allocation and temp file replication on the task scheduler side.
    • Attempted to submit a paper to IPDPS 2025 though was rejected.
  • Next steps
    • Sketch a paper about effectively using limited storage to accomplish enormous computations.
    • Develop an algorithm that divides long running tasks in DV5 into smaller ones, which reduces the overhead of rerunning tasks on worker evictions but increases the latency of scheduling a large number of small tasks, so the next plan would be trying to strike a balance between task scheduling and fault tolerance.
    • Develop an algorithm that checkpoints remote temp files on time to reduce the risk of losing critical files.


2025 Q2

  • Progress
    • Paper “Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows” submitted to SC’ 25.
    • Implemented checkpointing and replication strategies in TaskVine, both significantly improve workflow performance on unreliable clusters.
    • Resolved fundamental issues and inefficiencies in TaskVine; the scheduler now handles very large workflows efficiently. Our most recent success was that we completed an 8-million-task workflow in 20 hours.
    • Developing a web-based visualization tool for TaskVine logs, optimized for fast log parsing, CSV generation, and displaying key statistics. Available on GitHub.
  • Next steps
    • Discussed with team members how to improve scheduling efficiency by better handling pending and ready tasks—an issue that has caused severe slowdowns on unreliable clusters and remained unresolved for over half a year.
    • Finalizing our recent fixes and improvements in TaskVine, make sure we have a stable Conda release by the end of June and all our users are happy to use it.
    • Study the implications and challenges when scheduling massive workflows with millions of tasks.


Contact me: