AWS Data Lake, Machine Learning, and Infrastructure Optimization
About Phase Genomics
Phase Genomics is a life sciences innovation and biotechnology company specializing in genome assembly and analysis. They offer a comprehensive portfolio of laboratory and computational services and products based on the proximity ligation technique Hi-C. Their products include biochemistry kits for plants, animals, microbes, and human samples, as well as industry-leading genome and metagenome assembly and analysis SaaS. Based in Seattle, Washington, the company was founded in 2015 by a team of genome scientists, software engineers, and entrepreneurs. The company’s mission is to empower scientists and clinicians with state-of-the-art genomic tools that accelerate breakthroughs.
Phase Genomics has several products in different market niches both in-market and in-development. One of their in-development products applies their technology to cancer and other diseases where structural rearrangement of a patient’s genome is implicated in their condition, eventually aiming to develop this into a clinical diagnostic. In this process, Phase Genomics’ technology employs a biochemical process to fix genetic fragments which were physically proximate within the intact cell, essentially gluing together pieces of DNA that were touching each other inside a cell. These fixed DNA segments are read by a DNA sequencing machine, and then compared to the human reference genome in a process called alignment.
The output of this process is data that reveals three-dimensional spatial relationships between chromosomes, including abnormal junctions where pieces of chromosomes unexpectedly touch. These abnormal junctions are medically significant because they show pathologists and other professionals where a patient’s genome might contain mutations or other aberrations that may account for their disease. For example, if two pieces touch frequently in a patient but not the human reference genome, scientists may be able to identify specific genes in those regions which may be incorrectly “on” or “off,” and may provide a clinically actionable diagnosis. Phase Genomics’ method is intended to replace other existing, lower capability processes for identifying genomic abnormalities, such as karyotyping—which is labor-intensive and antiquated—saving scientists and clinicians significant time and money.
Phase Genomics’ process can be illustrated in figures called “heatmaps.” Shown below are two such heatmaps, which represent normal and abnormal genome sequencing and overlapping. Abnormalities are identified by bounding boxes in the heatmap on the right (a leukemia sample). Phase Genomics data scientists are working to develop deep learning and other machine learning models that leverage these heatmaps to identify and predict abnormalities in future samples, including in clinical settings. To support the machine learning efforts, Phase Genomics needed assistance architecting and building a data lake and machine learning infrastructure on Amazon Web Services (AWS).
Why Amazon Web Services
Phase Genomics had previously built other cloud-based products on AWS, and this project began with a focus on infrastructure optimization and services evaluation for those products. This gave Phase Genomics the opportunity to do a deep dive into their existing architecture, both evaluating what was working well and in some places re-architecting for future growth. 1Strategy’s guidance also enabled Phase Genomics to make decisions about whether to commit to AWS or go for a multi-cloud approach. Through the education that 1Strategy provided, the Phase Genomics team realized how many services were available to them on AWS and were impressed with demos and POCs presented by 1Strategy.
“We had always been a little hesitant to commit to one Cloud provider, but after 1Strategy showed us the breadth and depth of services that have come online in the last few years, we decided it was finally worth it to go all-in on AWS” says Shawn Sullivan, Chief Technology Officer, Phase Genomics. 1Strategy’s expertise in the AWS Well-Architected Framework made them well-suited to follow up on action items identified in Phase Genomics’ previous Well-Architected Review. Part of this infrastructure optimization stage of the project included building an AWS Organization, re-evaluating IAM best practices, and setting up Service Control Policies, all of which improved Phase Genomics’ security and operational posture.
Phase Genomics was already leveraging Amazon EC2 as their primary source of compute power for analysis jobs. However, though Phase Genomics had built tools to automatically provision and decommission EC2 resources in these jobs, engineers would occasionally forget to terminate resources after the job was complete. To save on these unnecessary costs, 1Strategy built solutions to tidy up EC2 instances and EBS volumes using AWS Config, AWS Lambda, and Amazon CloudWatch. These solutions were also built using AWS CloudFormation, enabling versioned infrastructure and easier deployments. These features have led to noticeable cost savings which, for a startup, ultimately translate into more runway.
Phase Genomics saw potential value in using CloudFormation to source-control and build their development environment, specifically a webserver architecture including an Application Load Balancer, Auto Scaling Group, and EC2 Launch Template. 1Strategy built CloudFormation templates for both a webserver architecture and a best-practices VPC configuration. These results are expected to simplify some operational tasks and make their web platform more robust.
1Strategy created a cost analysis for the Phase Genomics data lake that examined object access patterns and a few different combinations of storage solutions in Amazon S3: S3-Standard, S3-Infrequent Access, and Glacier with Standard, Expedited, or Bulk retrieval. An evaluation of cost savings opportunities in S3 indicated that Phase Genomics could save over $170,000 on annual storage costs. Additionally, 1Strategy provided Phase Genomics with a partitioning strategy in S3 that would support HIPAA compliance and allow them to query their data in real-time using Amazon Athena.
To aid the processing of genome deep learning analysis for their cancer diagnostic product, 1Strategy built a proof-of-concept to show how Phase Genomics could leverage Amazon SageMaker for analysis and machine learning. Now Phase Genomics data scientists are using ECR and SageMaker notebooks to train and deploy models. According to Sullivan, “Building our machine learning products requires a lot of R&D computational time, so getting that done efficiently directly impacts how quickly we will be able to build and ship them. Our new AWS capabilities, particularly SageMaker, will be critical tools for us to get our products built, tested, and shipped at the quality level needed for human health.”
1Strategy is an Amazon Partner Network (APN) Premier Consulting Partner, focusing exclusively on Amazon Web Services (AWS). 1Strategy helps businesses architect, migrate, and optimize their workloads on AWS, creating scalable, cost-effective, secure, and reliable solutions. 1Strategy also helps customers get real value from their data using comprehensive machine learning models and artificial intelligence. 1Strategy holds the AWS DevOps, Migration, Data & Analytics, Well Architected, and Machine Learning Competencies, and is a partner of the AWS Public Sector Program. 1Strategy was one of the initial ten AWS Partners globally who was qualified and authorized by AWS to conduct a Well-Architected Review and is among the top Well Architected partners in the AWS eco-system. With experts having deployed AWS solutions since 2007, 1Strategy is a leader in custom training—providing customers with the knowledge, tools, and best practices to manage those solutions over time. 1Strategy is a TEKsystems Global Services company with teams in Seattle and Salt Lake City, supporting customers throughout the US and across every vertical.