Amazon FSx for Lustre-Fully managed shared storage built on the world's most popular high-performance file system.

Sandaru Fernando

1. Introduction

1.1 What is Amazon FSx for Lustre?

Amazon FSx for Lustre is a fully managed, high-performance file system designed to handle compute-intensive workloads. The "FSx" refers to "fully managed file system," and Amazon currently offers FSx services for several widely used file systems, including the open-source Lustre file system. FSx for Lustre provides rapid data processing, scalability, and cost-effectiveness, making it ideal for applications such as machine learning, high-performance computing (HPC), video processing, and financial analytics.

1.2 What is the Lustre File System

The name "Lustre" is derived from a combination of "Linux" and "cluster," reflecting its nature as a parallel and distributed file system. It is primarily used for large-scale cluster computing and has been the preferred file system for some of the world's fastest supercomputers. As of November 2022, at least five of the top 10 fastest supercomputers, including Frontier (the world's number one supercomputer) utilize Lustre.

Lustre is a popular choice among supercomputers, large-scale data centers, simulators, and high-performance computing (HPC) organizations due to its exceptional scalability. It can efficiently manage clusters with tens of thousands of nodes, store dozens of petabytes of data across hundreds of servers, and deliver an average throughput exceeding a terabyte per second (TB/s), making it ideal for demanding workloads.

1.3 Key Features of Amazon FSx for Lustre

High-Performance Storage

FSx for Lustre delivers sub-millisecond latencies and can scale to hundreds of gigabytes per second of throughput, supporting millions of IOPS. This performance is crucial for applications that require rapid data processing and low-latency access.

Seamless Integration with Amazon S3

FSx for Lustre can be linked to Amazon S3 buckets, allowing users to access and process S3 data as a high-performance file system. This integration enables efficient data movement between S3 and FSx for Lustre, facilitating workflows that involve large datasets.

Flexible Deployment Options

Depending on workload requirements, users can choose between scratch and persistent file systems. Scratch file systems are suited for temporary storage and short-term data processing, while persistent file systems are designed for long-term storage and provide data replication for durability.

Multiple Storage Options

FSx for Lustre offers both SSD and HDD storage options. SSD storage is ideal for latency-sensitive, IOPS-intensive workloads, whereas HDD storage is cost-effective for throughput-intensive tasks that are less sensitive to latency. Additionally, HDD-based file systems can be provisioned with an SSD cache to enhance performance for frequently accessed files.

Security and Compliance

All data stored in FSx for Lustre is encrypted at rest, and encryption in transit is available in select regions. The service complies with various global and industry security standards, including PCI DSS, ISO, and SOC certifications, and is HIPAA eligible.

2. How Does Amazon FSx for Lustre Works

Amazon FSx for Lustre is a high-performance file system designed to scale efficiently across multiple disks and file servers. Built on Lustre, it provides scale-out performance, where throughput and capacity increase linearly as the file system expands. This architecture allows FSx for Lustre to deliver fast and concurrent data access to multiple clients while overcoming the bottlenecks associated with traditional file systems.

2.1 Scalability and Architecture

Amazon FSx for Lustre horizontally scales by distributing data across multiple file servers and disks. Each client has direct access to the entire data set, enabling high-speed parallel processing without congestion. This makes FSx for Lustre particularly suitable for compute-intensive workloads that require rapid access to large datasets.

Each FSx for Lustre file system consists of,

File Servers – The primary interface through which clients interact with the file system.
Disks (Storage Nodes) – Connected to file servers, responsible for storing data.
In-Memory Cache – Speeds up access to frequently used data by reducing disk reads.

2.2 Optimized Performance with Caching

To enhance read performance, Amazon FSx for Lustre employs a fast, in-memory cache that reduces latency and improves throughput for frequently accessed data. Additionally, HDD-based file systems can be configured with an SSD-based read cache, ensuring that commonly accessed data is retrieved quickly without requiring direct disk reads.

When a client requests data:

If the data is stored in the SSD cache or in-memory cache, the file server retrieves it instantly, reducing latency.
If the data is not cached, it must be fetched from disk, leading to higher latency due to lower disk throughput and network bandwidth limitations.

2.3 Performance Considerations

The overall performance of FSx for Lustre depends on network throughput and disk access speeds.

Reading from cache (SSD or in-memory) – Performance is primarily limited by network bandwidth.
Reading from disk or writing new data – Performance is affected by both disk speed and network throughput.

3. Differences Between Scratch and Persistent Mode in FSx for Lustre

When using Amazon FSx for Lustre, organizations have the flexibility to choose between two deployment options: scratch and persistent. The choice between these two options primarily depends on the duration and type of data storage required for the workload. Both options offer different performance characteristics and use cases, allowing organizations to optimize for either temporary or long-term storage needs.

3.1 Scratch File Systems

Scratch file systems are designed for short-term, high-throughput workloads where data doesn’t need to be stored long-term. These systems are ideal for situations where fast, temporary storage is required during compute-heavy operations. Since scratch file systems do not replicate data, they are susceptible to data loss if a failure occurs, which is an important consideration for certain types of workloads.

The main advantage of scratch file systems lies in their performance. They can provide significantly higher throughput, with a burst capacity that can reach up to six times the standard baseline throughput of 200 MBps per TiB of storage capacity (equivalent to about 1 TB of storage). This makes them perfect for workloads that require a high degree of data processing in a short period of time, such as machine learning model training, video rendering, and scientific simulations.

Best Use Cases for Scratch File Systems

Temporary Data Storage

For workflows that require short-term storage, where data loss is not a critical issue, such as staging areas for data processing.

High-Performance Workloads

Suitable for compute-heavy applications that demand high throughput, like genomics research, data analytics, and rendering tasks.

Cost-Effective Storage

Scratch file systems provide an economical solution for workloads where data doesn’t need to persist beyond the life of the computation.

Although they provide excellent performance for short-term tasks, scratch file systems are not ideal for storing critical data long-term due to the lack of built-in data replication and backup capabilities.

3.2 Persistent File Systems

Persistent file systems, on the other hand, are designed for workloads that require long-term data storage with durability and availability. This storage option automatically replicates data within the AWS Availability Zone, ensuring that data remains accessible even if a failure occurs. Unlike scratch file systems, persistent file systems are designed to be highly available and fault-tolerant, with automatic replication ensuring data durability and minimizing downtime in case of server or hardware failures.

One of the key advantages of persistent file systems is their ability to provide long-term storage for critical data without the risk of data loss. In the event of a failure, persistent file systems are capable of quickly recovering lost data, ensuring that the workload can resume without major disruptions.

Best Use Cases for Persistent File Systems

Long-Term Storage

Ideal for workloads that need persistent storage over time, such as storing data generated by containerized applications, scientific research datasets, or large-scale simulations.

Data Lakes and Big Data Applications

For organizations that need to process massive volumes of data, persistent file systems can provide the necessary storage for high-performance computing workloads and data lakes in Amazon S3.

High-Performance Computing (HPC)

For applications that demand consistent availability and reliability, such as financial analytics, engineering simulations, or rendering.

Workloads Sensitive to Availability Disruptions

Persistent file systems are well-suited for environments that require continuous operation, such as mission-critical applications and 24/7 processing systems.

Persistent file systems offer a more robust solution compared to scratch systems for workloads that cannot tolerate data loss or significant downtime. Their replication and durability features make them a suitable choice for long-running, high-performance applications.

4. Getting Started with Amazon FSx for Lustre

Prerequisites

Before setting up Amazon FSx for Lustre, ensure the following prerequisites are met:

AWS Account - An active AWS account is required to use FSx for Lustre.
Amazon EC2 Instance - an Amazon EC2 instance running Linux to interact with the Lustre file system. The EC2 instance should be launched in a VPC (Virtual Private Cloud).
IAM Permissions - The IAM role or user used must have sufficient permissions to create and manage FSx resources.

Step 1: Create an FSx for Lustre File System

To create a file system, begin by using the AWS Management Console, AWS CLI, or an SDK. Below is an example of how to create a file system using the AWS CLI:


aws fsx create-file-system \\
    --file-system-type LUSTRE \\
    --storage-capacity 1200 \\
    --subnet-id subnet-0123456789abcdef0 \\
    --lustre-configuration "DeploymentType=SCRATCH_2,PerUnitStorageThroughput=200"

In this example:

-storage-capacity specifies the storage capacity in GB.
-subnet-id links the file system to a specific subnet in a Virtual Private Cloud (VPC).
-lustre-configuration defines the deployment type (SCRATCH_2 is optimized for high-performance and temporary workloads) and the throughput level.

Step 2: Connect to the File System from an EC2 Instance

Once the FSx for Lustre file system is created, it needs to be mounted on an EC2 instance. The following steps show how to connect an EC2 instance to the Lustre file system.

First, retrieve the DNS name of the file system:


aws fsx describe-file-systems --query "FileSystems[0].DnsNam

Mount the file system to the EC2 instance using the mount command:


sudo mount -t lustre fs-xxxxxxxx:/fsx /mnt/fsx

In this example, replace fs-xxxxxxxx with the actual file system ID returned by the describe-file-systems command.

Step 3: Import Data from Amazon S3 (Optional)

Amazon FSx for Lustre can integrate with Amazon S3, allowing users to import data from S3 into the Lustre file system. To link an S3 bucket to the file system, use the following CLI command:


aws fsx associate-file-system-with-s3 \\
    --file-system-id fs-xxxxxxxx \\
    --s3-bucket-name my-bucket \\
    --s3-import-mode IMPOR

This will allow files from the S3 bucket my-bucket to be imported into the FSx for Lustre file system. The --s3-import-mode IMPORT option specifies that the data should be loaded into Lustre when first accessed.

Step 4: Write Data to the File System

With the file system mounted, users can now read and write data to the file system just like any other Linux file system. For example, creating a text file on the FSx for Lustre system:


echo "Hello, Amazon FSx for Lustre!" > /mnt/fsx/hello.txt

Step 5: Unmount the File System

After finishing the tasks, unmount the file system from the EC2 instance:


sudo umount /mnt/fsx

Step 6: Delete the File System

Once the file system is no longer needed, it can be deleted to stop incurring costs. Use the following command to delete the FSx for Lustre file system:


aws fsx delete-file-system --file-system-id fs-xxxxxxxx

This command will remove the file system and associated resources.

For additional info refer Amazon FSx for Lustre Guide.

5. Use Cases

Amazon FSx for Lustre is ideal for a wide range of compute-intensive workloads that require high-speed, scalable, and low-latency file storage. It is particularly beneficial for applications that demand rapid access to large datasets and parallel processing across multiple compute instances. Here are a few common use cases of Amazon FSx for Lustre.

1. Machine Learning

Machine learning (ML) models rely on vast amounts of training data, which must be processed simultaneously by multiple compute instances. FSx for Lustre provides a shared, high-throughput, and low-latency file storage system that accelerates data access and speeds up training tasks. The service integrates seamlessly with Amazon SageMaker, enabling organizations to streamline ML workflows and reduce model training time.

2. High-Performance Computing (HPC)

High-Performance Computing (HPC) is used in scientific research and engineering to solve complex computational problems. Workloads such as genome sequencing, fluid dynamics, weather modeling, and oil and gas exploration require massive datasets and must be processed efficiently. FSx for Lustre optimizes cost and performance for HPC applications by providing parallel file access across thousands of Amazon EC2 instances. It integrates with AWS Batch and AWS ParallelCluster, simplifying deployment for large-scale scientific workloads.

3. Media Processing and Transcoding

Media workloads, including visual effects (VFX), video rendering, and media production, require storage solutions capable of handling large digital files with minimal latency. FSx for Lustre provides fast read/write speeds to accelerate rendering times, streamline production workflows, and enhance real-time video editing. The scalability of FSx for Lustre ensures that large media assets can be processed efficiently across multiple compute nodes.

4. Autonomous Vehicles

The development of autonomous vehicles relies on AI models trained on massive datasets collected from vehicle sensors and cameras. These workloads require large-scale simulations to ensure vehicle safety and performance. FSx for Lustre enables organizations to concurrently access terabytes of sensor data from thousands of high-performance compute nodes, significantly accelerating model development and real-time simulation testing.

5. Big Data and Financial Analytics

Big data applications, such as financial modeling, risk analysis, and fraud detection, require high-performance storage to process and analyze large datasets efficiently. FSx for Lustre provides cost-optimized, high-throughput data processing, making it well-suited for industries like banking, insurance, and investment firms that rely on real-time analytics to drive decision-making.

6. Electronic Design Automation

EDA workloads involve simulating chip performance and identifying potential failures before production. These simulations require low-latency access to large design files and computational resources to run complex modeling tests. FSx for Lustre delivers the scalability, flexibility, and performance needed to accelerate the design process, reduce time-to-market, and improve the efficiency of semiconductor manufacturing.

6. Pricing

Amazon FSx for Lustre offers flexible pricing based on several factors, including storage capacity, throughput, and data transfer. Users can choose between different deployment options, such as scratch and persistent storage, each designed for specific workloads and cost structures.

Scratch File Systems

Optimized for short-term, high-speed processing, scratch file systems do not offer automatic backups and are ideal for temporary workloads. Pricing is based on the amount of storage provisioned.

Persistent File Systems

Designed for long-term storage, persistent file systems provide high durability and automatic backups. Pricing depends on the selected storage capacity and performance tier.

Additional costs may apply for,

Data Transfer

Charges vary based on data movement between AWS services and regions.

Backups

Automatically generated backups are charged separately based on the amount of stored data.

Throughput Capacity

Higher throughput options incur additional costs, depending on the level of performance required.

Amazon FSx for Lustre pricing is designed to be cost-effective, allowing organizations to pay only for the resources they use while benefiting from the high performance and scalability of the Lustre file system. For the latest pricing details, refer to the official AWS FSx for Lustre pricing page.

7. Conclusion

Amazon FSx for Lustre is a fully managed, high-performance file system that brings the power of the open-source Lustre file system to the cloud, enabling scalable, low-latency, and high-throughput storage for compute-intensive workloads. With deployment options in scratch and persistent modes, seamless integration with AWS services, and optimized caching mechanisms, FSx for Lustre accelerates data processing for applications like machine learning, high-performance computing (HPC), media processing, financial analytics, and autonomous vehicle simulations. By leveraging its parallel file system architecture, automatic scalability, and cost-efficient pricing, organizations can efficiently handle large-scale workloads while ensuring high availability and performance.