Introduction
In a previous post, we discussed how using serverless computing can assist life science researchers in analyzing their data in a more efficient and cost-effective manner. It also aids them in deploying their tools in a user-friendly way at lower costs.
In this post, we want to discuss a different cloud service, storage, that can be useful for life scientists. Scientific research typically generates a considerable amount of data from experiment results, which contain valuable information. Storing this data in a maintainable and, more importantly, shareable manner can pose a significant infrastructure challenge. Public clouds like AWS provide exciting options for such use cases. One example is AWS S3 Glacier, optimized for data retrieved infrequently or with latency tolerance, which is the case for many life science datasets.
Overview of the Storage Problem in Life Sciences
Traditionally, researchers tend to run their computational workload on their personal computers or High-Performance Computing (HPC) clusters. The limitation of personal computers is evident for both Storage Capacity and Shareability. On the other hand, HPC, which are data centers specifically designed for scientific workloads, are much better at the storage capacity problem. However, they still need to catch up on shareability, especially the HPCs, which do not provide a file-sharing service.
How Cloud can help: AWS S3 Glacier as a solution
Public cloud comes with almost infinite Storage Capacity and the highest level of shareability since they have data centers worldwide that provide milliseconds latency access to your files. For example, AWS has the Simple Storage Service (S3) for file storage and sharing. In order to use this service, you first create a bucket (think of it as a directory) and put your files (called Object in this context) in it. But here is the exciting part! AWS offers different storage classes that you can choose for your Objects. These storage classes are optimized for different use cases. The table below compares some of these classes:
Storage Type | S3 Standard | S3 Glacier Instant Retrieval | S3 Glacier Deep Archive |
Storage Cost for 1TB/Month | 23.55 USD | 4.10 USD | 1.03 USD |
First byte latency | milliseconds | milliseconds | hours |
Retrieval charge | N/A | 0.03 USD / GB | 0.02 USD / GB |
Minimum storage duration charge | N/A | 90 days | 180 days |
Note: Data transfer (from S3 to the Internet) costs around 0.05 ~ 0.09 USD per GB!
Here we describe two scenarios that you should go with AWS S3 in your research project:
1. When shareability matters!
Imagine you are collaborating on a project with researchers from other universities (usually from different countries). In such cases, you need to share data for both raw data and results of your experiments with your colleagues, so if you are locked to an HPC that can only be accessed via your university network, it can be very challenging. In this case, AWS S3 Glacier Instant Retrieval can be a good option since the number of downloads is limited, and the file size can be huge.
2. Large files which rarely retrieved but need to be stored for a long time.
It’s common to generate massive amounts of data out of your experiments. These data must be stored somewhere; otherwise, you must delete them to make space for your other experiments. You still need to keep these data since they may be needed during your paper review. In this case, AWS S3 Glacier Deep Archive is the best choice, offering the lowest price when retrieval time latency is not a concern!
Note: As we showed in the previous section, data transfer costs can be high in AWS S3, so for use cases with intense download requirements, you may want to consider free transfer options like LuxProvide.
Conclusion
In conclusion, the advent of cloud storage solutions such as AWS S3 Glacier has made it much easier for life scientists to store and share vast amounts of data generated from their research work. Depending on data shareability and retrieval frequency requirements, different AWS storage classes can be utilized to achieve cost efficiency. While it’s important to consider data transfer costs when choosing a cloud storage provider, the flexibility, global availability, and robustness of these services make them a compelling option for life science research data storage needs.
References :