What Cloud Has to Offer to Life Science: Storage

What Cloud Has to Offer to Life Science: Storage

·

5 min read

Introduction

In a previous post, we discussed how using serverless computing can assist life science researchers in analyzing their data in a more efficient and cost-effective manner. It also aids them in deploying their tools in a user-friendly way at lower costs.

In this post, we want to discuss a different cloud service, storage, that can be useful for life scientists. Scientific research typically generates a considerable amount of data from experiment results, which contain valuable information. Storing this data in a maintainable and, more importantly, shareable manner can pose a significant infrastructure challenge. Public clouds like AWS provide exciting options for such use cases. One example is AWS S3 Glacier, optimized for data retrieved infrequently or with latency tolerance, which is the case for many life science datasets.

Overview of the Storage Problem in Life Sciences

Traditionally, researchers tend to run their computational workload on their personal computers or High-Performance Computing (HPC) clusters. The limitation of personal computers is evident for both Storage Capacity and Shareability. On the other hand, HPC, which are data centers specifically designed for scientific workloads, are much better at the storage capacity problem. However, they still need to catch up on shareability, especially the HPCs, which do not provide a file-sharing service.

How Cloud can help: AWS S3 Glacier as a solution

Public cloud comes with almost infinite Storage Capacity and the highest level of shareability since they have data centers worldwide that provide milliseconds latency access to your files. For example, AWS has the Simple Storage Service (S3) for file storage and sharing. In order to use this service, you first create a bucket (think of it as a directory) and put your files (called Object in this context) in it. But here is the exciting part! AWS offers different storage classes that you can choose for your Objects. These storage classes are optimized for different use cases. The table below compares some of these classes:

Storage TypeS3 StandardS3 Glacier Instant RetrievalS3 Glacier Deep Archive
Storage Cost for 1TB/Month23.55 USD4.10 USD1.03 USD
First byte latencymillisecondsmillisecondshours
Retrieval chargeN/A0.03 USD / GB0.02 USD / GB
Minimum storage duration chargeN/A90 days180 days

Note: Data transfer (from S3 to the Internet) costs around 0.05 ~ 0.09 USD per GB!

Here we describe two scenarios that you should go with AWS S3 in your research project:

1. When shareability matters!

Imagine you are collaborating on a project with researchers from other universities (usually from different countries). In such cases, you need to share data for both raw data and results of your experiments with your colleagues, so if you are locked to an HPC that can only be accessed via your university network, it can be very challenging. In this case, AWS S3 Glacier Instant Retrieval can be a good option since the number of downloads is limited, and the file size can be huge.

2. Large files which rarely retrieved but need to be stored for a long time.

It’s common to generate massive amounts of data out of your experiments. These data must be stored somewhere; otherwise, you must delete them to make space for your other experiments. You still need to keep these data since they may be needed during your paper review. In this case, AWS S3 Glacier Deep Archive is the best choice, offering the lowest price when retrieval time latency is not a concern!

Note: As we showed in the previous section, data transfer costs can be high in AWS S3, so for use cases with intense download requirements, you may want to consider free transfer options like LuxProvide.

Conclusion

In conclusion, the advent of cloud storage solutions such as AWS S3 Glacier has made it much easier for life scientists to store and share vast amounts of data generated from their research work. Depending on data shareability and retrieval frequency requirements, different AWS storage classes can be utilized to achieve cost efficiency. While it’s important to consider data transfer costs when choosing a cloud storage provider, the flexibility, global availability, and robustness of these services make them a compelling option for life science research data storage needs.

References :


What is the benefit of using cloud storage for life science researchers?
Cloud storage, such as AWS S3 Glacier, provides almost infinite storage capacity and high share-ability for data, making it easier for researchers to collaborate and access their data globally.
How can AWS S3 Glacier be useful for life science datasets?
AWS S3 Glacier is optimized for data that is infrequently retrieved or with latency tolerance, which aligns well with many life science datasets that may not require frequent access.
Which AWS S3 storage class is suitable for collaborative projects involving large files?
For collaborative projects where shareability is crucial and large files need to be shared, AWS S3 Glacier Instant Retrieval is a good option due to limited downloads and support for huge file sizes.
What storage class should be used for storing large files with minimal retrieval needs over an extended period?
For storing large files that rarely need to be retrieved but must be stored for a long time, AWS S3 Glacier Deep Archive is the most cost-effective choice, offering the lowest price when retrieval latency is not a concern.