IO500 - The Lists

The Lists

We publish multiple lists for each BoF at SC and ISC as well as maintaining the current most up-to-date lists. We intend to not modify a list after the release date but in exceptional circumstances. However, we allow to improve and clarify list metadata upon the request of the submitters. We publish a Historic List of all submissions received and multiple filtered lists from the historic list. We maintain a Full List which is the subset of submissions which were valid according to the set of list-specific rules in place at the time of the list’s publication.

Our primary lists are Ranked Lists which show only opted-in submissions from the Full List and only the best submission per storage system. We have two ranked lists: the IO500 List for submissions which ran on any number of client nodes and the 10 Node Challenge list for only those submissions which ran on exactly ten client nodes.

In summary, each BoF has the following lists:

Historic List: ranking of all submissions ever received
Full List: subset of the Historic list of submissions that satisfy the existing IO500 submission guidelines
Research List: ranking of the research system submissions. This is a subset of the Full List of submissions, showing only one highest-scoring result per storage system. This list also contains all valid IO500 submissions prior to the creation of the Research List.
Production List: ranking of production system submissions. This is a subset of the Full List of submissions, showing only one highest-scoring result per storage system. Submitters who want a submission that is currently on the Research List to be on the Production List should contact the IO500 Steering Committee.
10 Client-node Research List: ranking of the research system submissions that used exactly ten client nodes. This is a subset of the Full List of submissions, showing only one highest-scoring result per storage system. This list also contains all valid IO500 10 client node submissions prior to the creation of the Research List.
10 Client-node Production List: ranking of the research system submissions that used exactly ten client nodes. This is a subset of the Full List of submissions, showing only one highest-scoring result per storage system. Submitters who want a submission that is currently on the 10 client node Research List to be on the 10 client node Production List should contact the IO500 Steering Committee.

Please note that the Ranked Lists only show the best submission for each storage systems, so if a storage system has multiple submissions only the one with the highest overall score is shown in the Ranked Lists. All submissions will appear in the Full and Historical lists.

Awards

There are 12 awards. We will have one each for overall score, metadata only, and bandwidth only for each of the Production and Research lists and their 10-client counterparts.

Production List Overview

To be eligible for the Production List, a submission must meet the definition of a “Production System”. The spirit of the Production System definition below is to ensure entries on the Production List are systems that are used by scientists, quants, security teams, data scientists over an extended period of time. Often this means the cluster has a batch scheduler and queues up incoming Production Application execution requests. It also further means that the duration of deployment is much longer than a few days, weeks or months…and is typically measured in years.

A “Production System” is an IO500 submission that:

consists of a Compute System and Storage System that on a 'regular', 'frequent', and 'ongoing' basis executes Production Applications that generate Production Data
achieves the highest Reproducibility Score
has no single point of failure in its Storage System

Where the terms mentioned are defined as follows:

Definition 1: Storage System - The set of nodes and storage devices used by Production Applications to store Production Data and against which the IO500 benchmark suite is executed.

Definition 2: Compute System - The set of nodes that execute Production Applications and execute the IO500 benchmark. These nodes may overlap with those utilized by the storage system.

Definition 3: Production Application - An application that is executed on the Compute System during normal operation. This application MUST solve specific scientific or business problems and CANNOT be benchmarks, storage system software, or any other application whose purpose is purely motivated by computer science questions. Note that a build farm would count as a production application since it is using production data (i.e., code).

Definition 4: Production Data - The data stored in the Storage System during normal operation that is read by written by Production Applications. This data MUST have scientific and/or business value and CANNOT be a well-defined pattern (e.g., 0s, 1s, repeated hash) or algorithmically generated (e.g., random, a function without scientific/business value).

Definition 5: System Metadata - Any information tracked or stored regarding application or storage system execution. The point of defining System Metadata is to clarify that it is completely separate and not included as a type of Production Data since it is not directly generated by the Production Applications, but rather a set of information about the system and its behavior. Examples of System Metadata include data provenance, indexes, logs (from applications like Splunk, the Production Application, or the Compute System or Storage System), performance/operational metrics, etc.

Definition 6: Reproducibility Score - Quantifies the level of reproducibility of a submission. It is assigned by the IO500 Steering Committee to all IO500 submissions based upon the amount of information provided to enable others to reproduce the IO500 result. This information includes system metadata (e.g., number of compute nodes, storage device information), storage system configuration (e.g., RAID encoding, tuning parameters, find script) and answers to the reproducibility questionnaire on the details of how the benchmark was executed. For more information on the reproducibility score, please see the document, "IO500 Submission Transparency and Reproducibility Proposal".

Definition 7: Single Point of Failure - The Storage System must be able to withstand any single failure of any component in its architecture. Upon a failure, while some amount of delay (on order of single digit minutes) is acceptable for the Storage System to recover and become consistent, applications must not be disrupted and there must be no manual intervention. For example, a failure of a storage device, storage server, switch, network cable while executing the IO500 benchmark would in no way disrupt the execution of the benchmark, which would be able to finish successfully (with most likely a lower score than if the failure had not occurred). This means that the storage system must continue operating without manual intervention upon the failure of any single storage device, server, NIC, etc. It is important to call out though that the failure of an entire rack via the top of rack switch would not be considered a single point of failure since it causes the failure of multiple storage servers. Note that this does not apply to the compute nodes on which the file system clients are deployed.

It is worth clarifying a few terms in the definition of a Production System:

regular - The system is (or will be) in place over an extended period of time and frequent large unexplainable gaps (e.g., days or longer) between the execution of Production Applications is unacceptable. A system that executed a few production applications last week and is planning to run more next week would not be a Production System. Note that maintenance periods during which production runs are paused are expected.
frequent - Production Applications are continuously executed on the system, often using an automated scheduler and set of queues to stage the incoming application execution requests. Put another way, Production Applications must consume the vast majority of aggregate computational time on the compute system. This could include a few large and long running production application jobs, many smaller and short-lived production application jobs, or anything in between.
ongoing - The system is (or will be) executing Production Applications for the foreseeable lifetime of the machine. One time executions or short execution bursts would not qualify a system as production.

The definition of Production System above includes cloud systems, but there are several critical differences of note:

The Compute System may be extremely dynamic, growing from 1 node to 1000s of nodes in minutes. This doesn’t change our assessment, as long as those same types of compute nodes are executing Production Applications.
The Storage System may dynamically grow and shrink, and may even be shut down for short periods of time. The Storage System used in the submission does not need to be the identical cloud deployment used to execute Production Applications, but its configuration must accurately represent the deployment configuration (e.g., VMs, storage, load-balancers) and size or shape of the storage system that has been or is currently used to execute Production Applications. For example, if a storage system reached 1PB with 32 storage server VMs while executing Production Applications, then the submitted Storage System could not be larger than 1PB, use more than 32 storage server VMs, or change the storage server VM/storage configuration.
Many HPC cloud deployments are for burst use cases, where the Production System will vary in size and shape depending on the required compute/storage resources to augment the on-premise system. This continues to fit the above definition since Burst is not a one time activity of running Production Applications but rather a continuous activity that is by its very nature bursty.
To obtain the highest Reproducibility Score, any cloud-based submission must list all of the specific cloud vendor’s compute/storage/networking offerings utilized so that anyone from the community could reproduce the IO500 results exactly assuming they could obtain the exact same storage system software.
The Institution in the IO500 submission must be the institution that is running the Production Applications, and not a cloud, storage, or any other type of vendor. Vendors may support the submission of an institution or, with their consent, submit on their behalf.

For burst buffers to be considered a Production System, they must also meet the requirements listed above on failure, reproducibility, and that it is used in a 'regular', 'frequent', and 'ongoing' basis for Production Applications that generate Production Data. For example, let's take a compute system that executes Production Applications with a Lustre-based storage system, but it's compute nodes also include NVMe devices for which a software storage layer exists that enables the execution of the IO500 benchmarks against these compute node NVMe devices. Even if this software layer meets the reproducibility and failure requirements for the Production List, if this layer is not used by the majority of Production Applications, any submission using this storage software layer would not be eligible for the Production List (but a submission with the Lustre file system may be eligible for the Production List).

Reproducibility Overview

Based on the amount and quality of the information provided, each submission is assigned a score that will be published on the iO500 webpage. Submissions will be encouraged to submit their target score, so that the committee can clarify any discrepancies prior to publication.

The initial scoring system will have these levels:

Undefined - This is the lowest level and has missing or limited system metadata regarding the clients and/or servers and has a missing or incomplete only client metadata, but nothing about the server.
Limited - This represents the typical system on the IO500 list as of SC21, where much of the client and server system metadata has been provided (although this will be expanded as part of this proposal) but the questionnaire provides insufficient level of information or is missing.
Proprietary - This represents submissions that provide all the required metadata and a detailed questionnaire, but the submitted system is not open-source or commercially available.
Fully Reproducible - The highest level. This represents submissions that provide all the required metadata, a detailed questionnaire, and the system is widely available to anyone without restrictions imposed by the provider. Software availability is typically via open-souce, a free download, or via a commercial license. Hardware is commercially available or the hardware design has been open-sourced or externally published.

You can check out the sample reproducibility questionnaire while preparing your submission.

The Lists

Awards

Production List Overview

Reproducibility Overview

Further Reading