Skip to content

Amazon Textract-based Document Redaction Proof of Concept

About King County of Washington

King County is home to Seattle, Washington and is the most populous county in the state and 12th most populous county in the United States. King County delivers vital services for more than 2 million residents and is one of the state’s largest employers with 16,000 dedicated employees.

The Challenge

When senior citizens in King County seek property tax relief, they can submit forms electronically or on paper. When paper submissions are received, they are scanned, then manually reviewed for Personal Identifiable Information (PII) then redacted by hand. A similar challenge exists when unredacted documents are submitted online. Currently a Senior Exemption Specialist assesses those forms to determine eligibility by age, income, and health care expenditures. King County wanted to automate, and therefore speed up, the process to reduce the physical labor involved and to increase the security and protection of seniors.

 

Why Amazon Web Services

King County was already running workloads on AWS. Based on positive experiences with AWS, King County knew they wanted to explore the additional possibilities for machine learning within AWS. At the recommendation from AWS, King County reached out to 1Strategy for support in this project. After the introductory meeting between King County and 1Strategy, where 1Strategy listened to understand the project’s objective and demonstrated their extensive experience and expertise with similar ML projects, King County knew they could confidently move forward with 1Strategy.

 

“AWS has leading-edge technology that allows King County to innovate and build solutions that solve complex business problems,” says Tanya Hannah, King County Chief Information Officer. “With 1Strategy’s partnership and technical guidance, the team is well-prepared to extend intelligent processing to similar use cases and beyond.”

The Solution

1Strategy worked with King County to create a document redaction prototype in four weeks. The project leveraged two powerful machine learning managed services for text and image recognition: Amazon Textract and Amazon Rekognition. The AWS Software Development Kit (SDK) was used to link custom code to these services in an Amazon SageMaker Jupyter notebook. This allowed the team to build customized pieces of the application, rather than having to implement machine learning algorithms from scratch. The project produced a proof-of-concept data pipeline to automate and speed up redaction of incoming documents to hide sensitive PII, but also to establish a baseline for using Amazon’s AI services.  Working with 1Strategy provided King County with the education needed to understand the product and how to iterate upon it.  “The capability of Textract to read these documents and identify the different fields within them is pretty impressive,” said Eric Maia, Solution Architect, King County.

 

One of King County’s primary goals is to speed up the process of redacting PII from the seniors’ document submissions. To do so, they worked with 1Strategy to design and build an AWS data pipeline (see figure 1) which uses machine learning to identify the type of document, then read the document and identify where redactions of PII should be performed. 

 

Documents, or data, enter the pipeline and are stored in Amazon Simple Storage Service (S3).  From there, documents in three formats (jpg, png, pdf) are imported to the Amazon SageMaker Jupyter Notebook service, where custom code standardizes the documents into a format appropriate for further processing.  This custom code deskews documents, separates multi-page documents into individual pages, and converts all pages into a standardized image format. Once each page is in a standard format, the application must determine what type of document it is. As in any machine learning application, the designer’s objective is to train a machine to do what people are currently doing. Just the way a novice Exemption Specialist would first need to learn the types of forms they are working with, our machine needs to be trained to do this task. For this purpose, the team chose Amazon Rekognition’s custom image classifier. The team supplied samples of various document types to Rekognition, along with expert guidance identifying each document’s type. From this training, Rekognition learns to do the task on its own.  Because Rekognition is an easy-to-implement managed service, this design and training process took only a few hours for an initial prototype, ready to be used by the data processing pipeline.

 

After a document has been identified by Rekognition, it is read by Amazon’s machine learning text recognition service, Amazon Textract. Textract can look for instances on the form which pair a prompt with a response, such as “Social Security Number” and “123-45-6789.”  Only a subset of these responses should be redacted.  Figure 2 illustrates the results of redacting only certain responses on a simulated sample tax document.  Most of the responses are left alone, but seven responses have either a blue and red box, or just a blue redaction box superposed over their information. The red boxes are successful redactions where Textract was able to find a desired prompt and then remove its response. The blue boxes use locational data to remove areas on the page where we expect to need a redaction.  

To set up this red/blue redaction process, the team used a separate custom application implemented in an Amazon SageMaker notebook to record all the prompts and geographic locations on an exemplar document of a specific type, along with an expert’s specification for whether to redact the corresponding response. The pipeline then uses this expert and locational data as applied to a new document of the same type to tell Textract which responses to redact (red box) and where those boxes should be on the new document (blue box). The locational data uses a linear regression to map from the coordinate system of the exemplar document to the coordinate system of the new document as determined by the positional data Textract gleans from the prompt/response pairs it finds on the new document.  Once a document is successfully redacted, it is stored in Amazon S3.  

The algorithm rejects documents that are difficult for Textract to read by setting a minimum number of prompt/response pairs and by looking for large rotation angles that are outside the limits of Textract’s ability to read. These documents are moved to a different S3 location and are made available for human review. 

 

For the purposes of this proof of concept, we used a simulated sample using a single document type. Of that sample, 84% of documents were marked as successful redactions, and of those, 95% were indeed successful. Of the remaining 16% of documents marked for human review, 87% indeed needed further human review. In the future, the team will extend the project to included additional document types using additional exemplar documents and more extensive training of the Rekognition classifier to broaden the types of documents the pipeline can handle.

 

The ease of implementation of AWS managed machine learning services provided an opportunity to create a proof of concept in four weeks and establish a baseline for King County’s prospects for using machines rather than people to read and redact documents. “The 1Strategy team really focused on making sure that we understood the product we were putting together, how it works, and how to extend and apply it,” said Maia.

 

“The prototype is delivering business value to the County. With 1Strategy’s help creating a prototype in just four weeks, the data pipeline has reduced the time it takes to search for and redact PII from 30 minutes to just 5 seconds per application,” said Hannah. “This automation is helping the County’s Assessors staff keep up with the 8,000 new applications received annually and clear the backlog of 4,000 unprocessed applications with 100,000 pages of accompanying documents from the 2021 tax year.”

 

About 1Strategy

1Strategy is an Amazon Web Services (AWS) Partner Network (APN) Premier Consulting Partner, focusing exclusively on AWS. 1Strategy helps businesses architect, migrate, and optimize their workloads on AWS, creating scalable, cost-effective, secure, and reliable solutions. 1Strategy also helps customers get real value from their data using comprehensive machine learning models and artificial intelligence. 1Strategy holds the AWS DevOps, Migration, Data & Analytics, and Machine Learning Competencies, and is a partner of the AWS Well Architected and Public Sector programs. 1Strategy was one of the initial ten AWS Partners globally who was qualified and authorized by AWS to conduct a Well-Architected Review and is among the top Well Architected partners in the AWS eco-system. With experts having deployed AWS solutions since 2007, 1Strategy is a leader in custom training—providing customers with the knowledge, tools, and best practices to manage those solutions over time. 1Strategy is a TEKsystems Global Services company with teams in Seattle and Salt Lake City, supporting customers throughout the US and across every vertical.