for Employees, Students

Data Management Concepts for Efficient and User-Friendly HPC

Scientific ComputingEmployeesStudents Online

Event content

Data management is generally challenging, particularly on HPC systems. Modern HPC systems offer different storage tiers with different characteristics. Some of these characteristics are for instance the availability of backups, the storage capacity, the IO performance, the difference between node local and globally available access, the semantics of the storage system, and the duration for which the storage endpoint is available, ranging from years to quarters, and sometimes only hours. This is confusing and entails different challenges and risks. First of all, users have to be aware of the different storage tiers and their performance profiles to optimize their job runtimes and not leave their jobs starving for data or wait for minutes that a Python environment has been loaded. However, users then need to move their results back to a storage tier with enough space and durability, to not lose their results at the end of a computation or soon after. While moving input and output data around users have to keep oversight over the data provenance to ensure the reproducibility and retrospective comprehensibility of their research.
In addition, sometimes users don't just want to copy an entire data set but want to explore only a concise subset. For this, a data catalog can be used where all available data is indexed with respect to some domain-specific metadata. Once this data catalog is filled with all the data sets of a user, concise queries can be used to select the input data, and ideally, stage it to the correct storage tier as part of the job submission process. This data catalog can also be used to keep the oversight of all data that are distributed over the different storage tiers.

This course will provide an introduction to the different storage tiers available at GWDG and for what workloads they should be used. Then the concept of a data catalog and its usage of will be covered. Both parts will offer hands-on exercises on our HPC system.

Learning goal

  1. Learn the concept of storage tiers and how to properly use them for best performance and data durability
  2. Learn the concept of a data catalog and how to use it to select input data for an HPC job based on domain specific metadata


Information about the event

Max. participants

20

Requirements

Some basic experience with working with HPC systems

Speakers
Trainer picture
Hendrik Nolte

Details

Number
1559
Format
Block Course
Language
English

Location

Online (BigBlueButton)


Contact

GWDG Academy
support@gwdg.de

Registration

Log in with your account to register for an event

Dates

This event includes following dates:

Date Location
1. 06.06.2024 09:00 - 12:00 Online (BigBlueButton)