for Employees, Students

Data Management Concepts for Efficient and User-Friendly HPC

Scientific Computing Employees Students Online

Event content

Data management is generally challenging, particularly on HPC systems. Modern HPC systems offer different storage tiers with different characteristics. Some of these characteristics are for instance the availability of backups, the storage capacity, the IO performance, the difference between node local and globally available access, the semantics of the storage system, and the duration for which the storage endpoint is available, ranging from years to quarters, and sometimes only hours. This is confusing and entails different challenges and risks. First of all, users have to be aware of the different storage tiers and their performance profiles to optimize their job runtimes and not leave their jobs starving for data or wait for minutes that a Python environment has been loaded. However, users then need to move their results back to a storage tier with enough space and durability, to not lose their results at the end of a computation or soon after. While moving input and output data around users have to keep oversight over the data provenance to ensure the reproducibility and retrospective comprehensibility of their research.
In addition, sometimes users don't just want to copy an entire data set but want to explore only a concise subset. For this, a data catalog can be used where all available data is indexed with respect to some domain-specific metadata. Once this data catalog is filled with all the data sets of a user, concise queries can be used to select the input data, and ideally, stage it to the correct storage tier as part of the job submission process. This data catalog can also be used to keep the oversight of all data that are distributed over the different storage tiers.

This course will provide an introduction to the different storage tiers available at GWDG and for what workloads they should be used. Then the concept of a data catalog and its usage of will be covered. Both parts will offer hands-on exercises on our HPC system.

Learning goal

Learn the concept of storage tiers and how to properly use them for best performance and data durability
Learn the concept of a data catalog and how to use it to select input data for an HPC job based on domain specific metadata

Information about the event

Max. participants: 20
Requirements: Some basic experience with working with HPC systems
Speakers: Dr. Hendrik Nolte

Details

Number: 1559
Format: Block Course
Language: English

Location

Online (BigBlueButton)

Contact

GWDG Academy
support@gwdg.de

Registration

Dates

This event includes following dates:

	Date	Location
1.	06.06.2024 09:00 - 12:00	Online (BigBlueButton)

Similar events

Scientific Computing
Using Jupyter Notebooks on HPC
06.02.2025
... further info

Scientific Computing
Using the GWDG Scientific Compute Cluster
25.02.2025
... further info

Scientific Computing
Data Management Concepts for Efficient and User-Friendly HPC
06.03.2025
... further info