In the Healthcare Industry HIPAA compliance is an essential requirement for any health records related data processes. Title II from HIPAA defines the policies and guidelines for compliance. The Security Rule under Title II lays out the definition and requirements for securing all Protected Health Information (PHI). The Security Rule has many guidelines centered around making sure only authorized people have access to PHI, and ensuring that PHI data maintains its integrity and security. Data encryption is a vital tool for adhering to the HIPAA guidelines for PHI data.
Medical Claims Data is ripe and ready to be used for data analysis, to discover trends and patterns in the data. Data analysis that can be used for decision making and optimization of the Healthcare delivery process.
Big Data tools are essential for performing data analysis and extracting information from the large amounts of data available in medical claims data. Big Data tools and methodologies have been providing business value in many domains, and the Healthcare domain is no different. There is a lot of value to be found in medical claims data.
Before any of the Big Data tools and techniques can be used with claims data, that data must adhere to the HIPAA Title II Security Rule guidelines. This can be a very daunting task, especially due to most Big Data tools being open frameworks to ease analysis.
To leverage the Big Data tools and techniques, and gain the value and benefits they provide, we implemented a solution that covers data access management as well as data encryption. We architected a solution and implemented it using some scripting and automation. Below is a high-level diagram of the architecture.
As is expected with any complete encryption solution, we needed to address encrypting data at rest as well as data in transit. To encrypt data at rest you need to ensure that data is encrypted in the following places:
- EMRFS (For data in s3) – This is achieved via s3 client-side encryption with AWS KMS.
- HDFS – via HDFS transparent data encryption which is described in the Apache Docs.
- Temporary space – Using volume encryption of the directories that contain temporary data.
To encrypt data in transit you need to ensure data encryption in the following scenarios:
- S3 to an EMR cluster node – Traffic from S3 to an EC2 instance that is part of an EMR cluster is transported using HTTPS. Also, if you are using EMRFS support for S3 client-side encryption, the object is encrypted over the wire (the decryption happens in the EMRFS client).
- Hadoop RPC – Hadoop RPC is used by API clients of MapReduce, JobTracker, TaskTracker, NameNode and DataNodes. You can read more about it here https://wiki.apache.org/hadoop/HadoopRpc. Hadoop’s RPC implementation supports SASL. It is recommended to set hadoop.rpc.protection to privacy in core-site.xml.
- HDFS DTP – When using HDFS Transparent data encryption this traffic is automatically encrypted.
- Hadoop MapReduce Shuffle – In the shuffle phase, Hadoop MapReduce (MRv2) shuffles the output of each map task to reducers on different nodes using HTTP by default. You can configure Hadoop MapReduce to use HTTPS by enabling “encrypted shuffle”. This is enabled in this script.
- Spark block transfer service – This is can be encrypted using SASL encryption in Spark 1.5.1.
As of EMR release 4.1.0, you can now enable HDFS Transparent Data Encryption (TDE) for HDFS encryption at rest. You can now set up the Hadoop KMS component included in Hadoop 2.6 to supply keys to HDFS. Also, HDFS never handles unencrypted data, because data is encrypted/decrypted by the client.
Configuration of EMR
This section details how to create an EMR cluster that meets the requirements detailed above.
The steps are broken into five high level areas:
- Upload the required Configuration files and scripts to an s3 bucket.
- Create a KMS key for EMR
- Create a cluster from the console.
- Run a sample job to confirm successful cluster creation.
- Validate that encryption is working correctly.