AWS Big Data–Amazon S3 Datalake
FamilySearch International is the largest genealogy organization in the world. FamilySearch is a nonprofit, volunteer-driven organization sponsored by The Church of Jesus Christ of Latter-day Saints. Millions of people use FamilySearch records, resources, and free online services to learn more about their family history.
As the largest genealogy organization in the world, FamilySearch adds 400 million new historic records every year. FamilySearch makes these records, resources, and services available online at FamilySearch.org and at the 5000+ family research centers around the world, including the renowned Family History Library in Salt Lake City, Utah. FamilySearch faces the task of collecting raw data from multiple sources and then mining and analyzing the collected data that feeds into FamilySearch’s applications and services. This undertaking is critical in meeting its objective of helping the world’s people connect with their families’ histories and stories.
FamilySearch’s challenge was to find a data storage and processing solution that offers more efficiency and security in analyzing their back-end data than their current infrastructure.
Why Amazon Web Services
FamilySearch was already benefiting from Amazon Web Services (AWS), as most of their infrastructure runs on AWS. However, the non-profit family history organization was seeking a solution for the following: managing and governing their back-end data applications, advancing and securing big data analytics, and the ability to incorporate machine learning into their architecture for their long-run development.
Having attended 1Strategy’s AWS events and aware of 1Strategy’s capabilities and reputation, FamilySearch partnered with 1Strategy, an AWS Premier Consulting Partner. The organization turned to 1Strategy to benefit from their AWS expertise in big data, AI/ML solutions, and know-how in solving this type of challenge in the most cost-effective way.
1Strategy developed a prototype data lake solution on AWS with an estimation of running expenses, allowing FamilySearch to have defined mechanisms to catalog and secure their data.
FamilySearch collects data in many forms and formats, both in batch processing and through real-time streaming processes. For FamilySearch, when user data is collected in real-time from each application database, it can be encoded in any type or format. FamilySearch needed to have an overall view of users without centralizing the data storage layer. Creating such a picture of users requires searching across all data stores, regardless of their location or language, while still matching data to the appropriate user.
1Strategy recommended building a data lake solution on AWS because it will allow FamilySearch to automatically organize, catalogue, and map data across their applications at petabyte scales, regardless of where the data is coming from or how it’s encoded.
“A data lake solution on AWS meets FamilySearch’s challenge of accumulating data from hundreds of its applications,” said Rich Uhl, Founder & CTO of 1Strategy. “That data can then be used for analytics and applied Machine Learning applications for its highly data-driven development, well into the future.”
1Strategy conducted a Proof of Concept project and created a prototype Amazon Simple Storage Service (Amazon S3)-based data lake solution that fits the needs and challenges of FamilySearch. The centralized repository of the data lake enables FamilySearch to build high performing data analytics and business identification tools by:
- Storing all kinds of raw data—structured and unstructured—at any scale
- Supporting different types of analytics on the data such as dashboards, visualizations, big data processing, and machine learning based on their needs.
To summarize the prototype diagram, using an Amazon S3-based data lake architecture will provide FamilySearch benefits in the following areas:
1Strategy recommended utilizing Amazon Kinesis Firehose to manage real-time streaming data from various application sources. Firehose is a fully managed and scalable service used to ingest multiple types of data from various sources in near real-time, storing them in a robust and scalable fashion in Amazon S3, Amazon Redshift, or Amazon Elasticsearch. Other benefits include:
- Firehose can trigger Amazon Lambda functions, giving developers at FamilySearch the ability to create event-driven transformations of their real-time data.
- Is cost-effective—there are no set up fees or upfront commitments; you pay only for the amount of data throughput you provision on Firehose.
The data catalog is a vital component of an Amazon S3-based data lake and provides a query-able interface of all assets stored in the data lake’s Amazon S3 buckets. The data catalog is designed to provide a single source of truth about the contents of the data lake. Based on the nature of FamilySearch’s business needs, 1Strategy advised FamilySearch in using the Data Catalog within AWS Glue.
- The Data Catalog provided by AWS Glue is able to organize and track data for cross-application queries and future analytics tools.
- The AWS Glue-generated catalog will contain information about data assets that will be transformed into various formats and table definitions and can be used by Amazon Athena, Amazon Redshift, Amazon Redshift Spectrum, and Amazon EMR, as well as third-party analytics tools that use a standard Hive Metastore Catalog.
An Amazon S3-based data lake solution has features, tools, and policies to store and protect data as well as monitor, analyze, and govern that data. Alerting and monitoring the events of FamilySearch’s data lake are provided by Amazon CloudWatch. FamilySearch will be able to use CloudWatch to collect and log API calls and trigger alarms.
Collection, storage, and analysis of data is critical for FamilySearch in helping millions of users to research and access their genealogy. An Amazon S3-based data lake is the solution that meets FamilySearch’s data needs for applications and services in a highly performant and secure way.
About the Partner
1Strategy is an Amazon Partner Network (APN) Premier Consulting Partner, focusing exclusively on Amazon Web Services (AWS). 1Strategy helps businesses architect, migrate, and optimize their workloads on AWS, creating scalable, cost-effective, secure, and reliable solutions. 1Strategy also helps customers get real value from their data using comprehensive machine learning models and artificial intelligence. 1Strategy holds the AWS DevOps, Migration, Data & Analytics, Well Architected, and Machine Learning Competencies, and is a partner of the AWS Public Sector Program. 1Strategy was one of the initial ten AWS Partners globally who was qualified and authorized by AWS to conduct a Well-Architected Review and is among the top Well Architected partners in the AWS eco-system. With experts having deployed AWS solutions since 2007, 1Strategy is a leader in custom training—providing customers with the knowledge, tools, and best practices to manage those solutions over time. 1Strategy is a TEKsystems Global Services company with teams in Seattle and Salt Lake City, supporting customers throughout the US and across every vertical.
For more information about how 1Strategy can assist your company in migrating to AWS, building scalable, secure, and Big Data analytics, or optimizing AWS solution, visit 1Strategy.com or contact us at info@1Strategy.com.