Creative Solutions For
Solugen's HPC Problems
BioTech Startup's Computing Struggles Allayed With AWS BatchMigration Modernization Serverless
Solugen is a biotech startup that produces industrial chemicals from sources other than petroleum. Solugen was using a single EC2 instance to do large amounts of high-performance computing (HPC) using RosettaCommons - a software for computational modeling and analysis of protein structures. A single job would take weeks to complete. They needed a solution that was more time- and resource-efficient.
Solugen is a biotech start-up that produces high-performance chemicals from plant-derived substitutes. They design and grow enzymes that can turn sugar into chemicals needed to make a variety of products and can be used in many industrial applications.
Solugen uses a software platform called RosettaCommons to model protein folding behavior. This analysis is core to their research and this research is core to their business. Due to the complexity and computational needs of this software, they were facing a situation where scaling their business using traditional computational methods would have meant making considerable investments into single instances of on-prem or cloud computing infrastructure and still not ending up with a scalable solution. Additional jobs would have meant additional machines. They needed a creative way to scale their application for larger jobs when that computational power was needed, but they needed to be able to do this in a reasonable amount of time and at an affordable cost.
Why Solugen Chose AWS?
AWS was a great solution for Solugen because of a particular service: AWS Batch. AWS Batch is a service that allows you to provision servers on-the-fly so you can submit workloads and pay for them as you go, rather than pre-provisioning compute infrastructure. A solution using AWS Batch would allow Solugen to provision only the resources they need when they need it, which would be considerably more affordable than another solution.
Why Solugen Chose Cloud303?
Cloud303 was recommended to Solugen by AWS. At the time, Cloud303 did not have experience creating a bioinformatics pipeline like this, but nothing had ever been created to solve this particular problem, so AWS did not have a clear path either. What Cloud303 did have, however, was a reputation for coming up with creative, outside-the-box solutions for their clients at affordable costs. Cloud303’s reputation preceded them and both AWS and Solugen trusted that they would rise to the challenge.
Phil Supinski Sujaiy Shivakumar
CEO/Solutions Architect CTO/Solutions Architect
AWS Batch Amazon ECS VPC
Cloud303 leveraged AWS Batch as the ideal solution for their HPC workload, as the customer would only need to pay for resources used.
To start, a templatized Rosetta environment was created using Docker containers so the jobs could seamlessly scale. Then, multiple compute environments were deployed (for testing as well as production jobs). To optimize costs, S3 buckets were used to house data. To give faster access to storage and ensure data did not leave Solugen’s VPC, VPC endpoints were created.
One goal was to simplify Solugen’s experience as much as possible, so the data pipeline starts with the upload of an input file to S3. That file contains all the relevant instructions. The runtime will download the file, read the instructions and start the job based on those instructions. It is also possible to use the environment to spin up a single server (which picks up the job from S3, runs the job, then uploads the output artifact back to S3).
The more elegant solution, however, and the one that truly changed Solugen’s workflow, was the parallel computing solution that was designed. By leveraging the OpenMPI framework in the runtime environment, multiple nodes could be spun up by AWS Batch to process a single job. A number of instances could be spun up - one assigned as the master and the rest of them being worker nodes. The worker nodes would report their unique ID to the master and once the master had enough nodes to run the submitted job, it would run the Rosetta script while OpenMPI managed the computational distribution between the many worker nodes.
Building an ephemeral, distributed workload like this does have one significant challenge compared to a single server - storage. To solve that, Cloud303 incorporated an EFS file system to serve as a common storage solution. All worker nodes were mounted to the EFS share as a local drive so all artifacts produced by the cluster ended up in the same place when the nodes finished processing. Then the master node would compile the artifacts into a deliverable and upload them to an S3 bucket.
Prior to this solution, Solugen was running Rosetta on a single EC2 instance and jobs would take about two weeks to complete. Due to the massive amount of parallelization that the new solution enables, a job that took two weeks before now takes about two hours to complete. This has been an enormous benefit to their business, contributing significantly to the efficiency of their workload. In September 2021, Solugen completed a US$357 million Series C round.
Partner Opportunity Acceleration Funding"MAP" Migration Acceleration Program