IRONBYTE

Introduction

It is proposed to create a Data Processing Center with a total capacity of at least 2000 PFlops. The IRONBYTE architecture for distributed launch and management of computing tasks in the field of AI should be used as the design of the data center.
The proposed architectural approach will allow to reduce the requirements for high availability of all components within the data center, with a focus on ensuring the high availability of the core system, which is responsible for all data center tasks and data storage orchestration.
The primary component of the Data Center will be the high-performance computing node, designated as the IRONBYTE RIG. In this scenario, the failure of a IRONBYTE RIG, regardless of the cause, will not impact the overall computational processes within the data center. The orchestrator will address the failure of a specific node (or group of nodes) and reassign the tasks previously executed by that node to another node within the Data Center. As the tasks are processed in batch mode, the final result will remain unaffected; however, a slight increase in the total execution time is possible.

Data center scaling task

The expansion of the data center with the same type of nodes is not subject to any limitations. Once the data center has reached the size of over 5,000 nodes, it will be necessary to increase the number of master nodes in order to effectively address the task orchestration issue. Furthermore, the addition of next-generation compute nodes to the Data Center will necessitate no alterations to the existing architectural or software frameworks. It will be necessary to provide explicit instructions to only some types of running tasks in order for them to utilise the IRONBYTE RIG of the new architecture.

Tasks that can be solved

- LLM (Large Language Models) training, fine-tuning, ML (Machine Learning). - Storage of models, datasets, AI (Artificial Intelligence) software libraries (including all intermediate versions used in data center operations). - Running (inferencing) models, forming a pipeline of model inferencing and scaling the pipeline depending on the load (requests).

Irregular problems

It is possible to combine tasks on a IRONBYTE RIG if the required computing resources are available. In the event of a planned modernization of a Data Center, it is feasible to designate tasks for execution on the requisite computational nodes (e.g., those equipped with the accelerators of the new architecture). Alternatively, the deployment of legacy tasks within a software environment may be facilitated, encompassing fixed versions of NVIDIA drivers and libraries that are currently unsupported (i.e., legacy code support). It seems reasonable to posit that this feature will be in demand for inferencing models that can remain operational for extended periods without requiring changes.

Main parameters and characteristics of the Data Center to be created

A high-performance compute node is a specialized 10U-high server equipped with graphics processing units (GPUs), offering a total performance on floating-point 32 (FP32) operations of 730 teraflops (TFlops). In order to achieve a performance of 2,000 PFlops, it is necessary to place 2,877 nodes (with a 5% reserve factor). The maximum power consumption of the node is 43,362 kWh per year.
In other words, the computing nodes will be provided with a minimum of 13 MW*h of power. Each node will be provided with five independent power supply lines, though not with an uninterrupted power supply. Furthermore, the cooling system has the potential to consume up to 5 MW*h, depending on meteorological conditions.
The network topology of the Data Center allows for the implementation of straightforward Layer 2 solutions. Each computing node is capable of being connected to the Data Center via single 10-Gbps port (without redundancy), resulting in a minimum total number of network ports of 3,000. The channel switches will be equipped with uplinks of 40 Gbps and will be organized in a ring topology. Furthermore, a technological dedicated network will be necessary, with a capacity of no more than 1 Gbit/s, for the management of network devices, access to IPMI servers, and engineering equipment within the data center.
The data storage system of the Data Center is designed in a manner that allows tasks initiated on computational nodes to gain access to the required data with minimal network latency. This is accomplished through the establishment of a distinctive peertopeer network of IRONBYTE storage nodes. The data within the network are distributed in a uniform manner with the requisite redundancy, thereby enabling the orchestrator to identify available computational nodes for the task at hand, with due consideration given to the proximity of the requisite data to said nodes. Consequently, the data produced by the task is also uploaded back to the IRONBYTE storage peer network, distributed across the network, and a new task utilising this data will be launched in proximity to the requisite storage node.
The Data Center management core, which includes IRONBYTE storage nodes, will comprise approximately 2% of the total number of compute nodes, with the capacity to be scaled in accordance with the growth of the data center. The management core has been designed to accommodate a minimum of 30,000 simultaneous tasks.
The cost of engineering support, including the cooling system, power supply, and power backup (control core), does not exceed 10% of the cost of computing nodes in the data center.

Comparison with alternative solutions

Parameter / Type	IRONBYTE RIG	Nvidia server 8xA100	Nvidia server 8xH200
TFlops (FP32)	730	156	536
TFlops (FP8)	6,600	4,992	32,000
RAM GPU (GB)	240	640	1,128
Cost	$40,000	$160,000	$400,000
IRONBYTE RIG efficiency factor in synthetic load	100%	1900%	1400%
Cloud 2000 PFlops (synthetic load)	$115,000,000	$2,153,000,000	$1,567,000,000
Practical effectiveness of IRONBYTE RIG in a synthetic load		Over 40%	Over 60%
Analog of IRONBYTE RIG with consideration of common tasks in LLM	$115,000,000	$2,153,000,000 / $1,292,000,000	$1,567,000,000 / $635,000,000

Efficiency summary

- the proposed solution is oriented first of all on computations in FP32/16, which have nowadays the greatest practical sense in ML/LLM problems;
- the issue of limited memory capacity in RIGs (particularly in larger models) has been addressed through the implementation of diverse mathematical solutions, namely IRONBYTE. This approach has demonstrated minimal impact on the quality of learning, specifically in the pre-learning phase;
- the speed characteristics of inter-node exchange via NVLINK are compensated by the separation (parallelization) of training (pre-training) tasks. This is based on the correlation separation of the dataset between computational nodes and the subsequent combination of model layers obtained on different nodes;
- efficiency at realization of tasks of alternative type, in terms of “pricequality” ratio is more than 1000%;
- efficiency, when implementing the tasks of competitors' specialization, is 40/60% respectively;
- the “price-quality” advantage, when comparing identical data centers in all parameters, is more than 2 times.

Full version of proposal