Kendra Doc Count by Datasource

1 minute read

Problem Statement

Amazon Kendra is a managed service from AWS which provides an easy way to set up a search solution your users can interact with.You start with an index and set up data sources which are primarily connectors to different ways to consume the document/data you want to search against.

Even though service provides a lot of functionality wrapped under the connectors and the managed index and a good set of apis to interact/query the index, I have one pet peeve.

Document Count by data source” - It is a little annoying to find that there is no easy way to figure out the total number of documents included in a specific data source. The list of available columnns on Amazon Kendra UI are as below doesn’t include a “Document Count” doc-count-1 doc-count-4 The closest to a document count you have on the Kendra side is:

  • the total document count for the index on the home page of Kendra. doc-count-2
  • the total document count for the index on the Search Analytics Dashboard doc-count-3

Solution

So I have listed the issues and the multiple places the Kendra service could have (but haven’t) included this information. So what can we do this in case ?

Use the available “Search indexed content” and provide the _data_source_id of your datasource in quotes and search for it. doc-count-5 Voila. So why does this work ?

The _data_source_id is one of the document metadata which the search query returns as a string always and hence it finds all the documents with the matching string giving you the count alongside. The query api can also do this if you want to do this with a lambda. doc-count-6