Inferring fields (using AWS Textract)¶
The Impira CLI allows you to use AWS Textract to detect key:value pairs in a document. This can be useful to avoid
having to create a schema, field-by-field, in Impira. Currently, impira infer-fields
is capable of inferring
text and selection (checkbox) fields. Although Textract does not distinguish between text, number, and timestamp
types, impira bootstrap
can automatically determine the correct field types within Impira.
Requirements¶
The infer-fields
command requires that your AWS credentials are setup for boto3
to run. The
AWS documentation
describes several ways to do so. The command does not allow you to specify your credentials explicitly.
The credentials must have access to Textract and to read/write files from Amazon S3. It will try to
create an impira-cli
bucket in your account automatically if one does not already exist, so it must have
permissions to do that, unless you specify one explicitly on the command line.
Inferring fields¶
You can run a command like the following:
$ impira infer-fields /path/to/document.ext
By default, the infer-fields
command will save the captured fields and data in your system’s temporary directory and
output its full path at the end of execution. You can specify your own data directory optionally with the --data
argument.
You can use the directory that infer-fields
outputs in the bootstrap
command to automatically setup
an Impira collection with the inferred fields and training data. Although you can bootstrap an Impira collection with as
little as one file, you can run impira infer-fields
across multiple files of the same document type to generate additional
training data.
For a full list of options, run impira infer-fields --help
.