How to limit search results when combining Azure Cognitive Search & Azure OpenAI using Azure AD groups

Azure OpenAI service is being used to create more interactive & intelligent chatbots. A key use case is being able to have the OpenAI service respond to user requests using your own data. My GitHub fork contains an example implementation (with security filtering using Azure AD groups).

Why combine Azure Cognitive Search & Azure OpenAI?

Just trying to run a web app using the generic Azure OpenAI service and having it be able to respond to user requests that are specific to your data won’t work. This is because the OpenAI model has not been trained on your specific data.

A different approach would be to try and inject all the data that the OpenAI service needs into each user request. However, this approach also doesn’t work because there are limits to how much data each request can contain (token limits). Even though these limits are increasing, they will never be able to include all your data. In addition, it would be inefficient & unfeasible to try and inject all your data on each request.

Instead, the recommended approach is to use the Retrieve, Augment & Generate (RAG) pattern. In this approach, you take the user’s initial query, pass it to Azure Cognitive Search and let Cognitive Search retrieve relevant snippets of information (based upon standard search engine technology).

Finally pass those snippets along with the user’s initial query to the OpenAI service to generate a response. This has the added benefit of being able to demonstrate the thought process used to generate the response & provide reference links to additional information.

Why filter search results from Azure Cognitive Search

Cognitive Search is a search engine that indexes all the documents, databases, etc. you give it. However, there may be cases where you want an index of lots of data, but you don’t necessarily want all users to have access to all data.

  • Financial data
  • HR data
  • Classified data

In cases such as these, you need to filter search results depending on who the user is (the CFO should be able to see all financial data, even though other employees may not be allowed to see it).

Azure Cognitive Search supports this use case with security filters. Security filters allow you to pass in additional information when retrieving search results to limit results to only data the user has access to.

There are three steps required to implement security filtering

  • Create an index that includes a field for security filtering (such as Azure AD group IDs)
  • Include which Azure AD group IDs are allowed to see the data on initial index of each document
  • Include the list of Azure AD group IDs that the user is a part of so the security filtering can be applied on each query

Create an index that includes a field for security filtering

When you set up a Cognitive Search index, you need to include a security filtering field. You will make this a filterable field & a non-retrievable field.

Example REST API call

POST https://[search service].search.windows.net/indexes/securedfiles/docs/index?api-version=2020-06-30
{
     "name": "securedfiles",  
     "fields": [
         {"name": "file_id", "type": "Edm.String", "key": true, "searchable": false },
         {"name": "file_name", "type": "Edm.String", "searchable": true },
         ...
         {"name": "group_ids", "type": "Collection(Edm.String)", "filterable": true, "retrievable": false }
     ]
 }

Example C#

var index = new SearchIndex(options.SearchIndexName)
{
    Fields =
    {
        new SimpleField("file_id", SearchFieldDataType.String) { IsKey = true, ... },
        new SimpleField("file_name", SearchFieldDataType.String) { ... },
        ...
        new SimpleField("group_ids", SearchFieldDataType.Collection(SearchFieldDataType.String))
            { IsFilterable = true, IsHidden = true },
    },
    ...
};

await indexClient.CreateIndexAsync(index);

Include which Azure AD group IDs are allowed to see the data on initial index of each document

Each time a new document is uploaded & indexed, you need to include the list of Azure AD group IDs that are allowed to have this document in their search results. These Azure AD group IDs are GUIDs.

Example REST API call

{
    "value": [
        {
            "@search.action": "upload",
            "file_id": "1",
            "file_name": "secured_file_a",
            "file_description": "File access is restricted to the Human Resources.",
            "group_ids": ["group_id1"]
        },
        {
            "@search.action": "upload",
            "file_id": "2",
            "file_name": "secured_file_b",
            "file_description": "File access is restricted to Human Resources and Recruiting.",
            "group_ids": ["group_id1", "group_id2"]
        },
        {
            "@search.action": "upload",
            "file_id": "3",
            "file_name": "secured_file_c",
            "file_description": "File access is restricted to Operations and Logistics.",
            "group_ids": ["group_id5", "group_id6"]
        }
    ]
}

Example C#

var searchClient = await GetSearchClientAsync(options);
var batch = new IndexDocumentsBatch<SearchDocument>();
foreach (var section in sections)
{
    batch.Actions.Add(new IndexDocumentsAction<SearchDocument>(
        IndexActionType.MergeOrUpload,
        new SearchDocument
        {
            ["file_id"] = section.Id,
            ["file_name"] = section.SourceFile,
            ["group_ids"] = section.GroupIds
        }
     ));

    IndexDocumentsResult result = await searchClient.IndexDocumentsAsync(batch);
    ...
}

Include the list of Azure AD group IDs that the user is a part of so the security filtering can be applied on each query

On each query, include the list of Azure AD group IDs the user is a part of (that are specific to this application). This will be formatted as an OData query.

Example REST API call

POST https://[service name].search.windows.net/indexes/securedfiles/docs/search?api-version=2020-06-30
Content-Type: application/json  
api-key: [admin or query key]
{
   "filter":"group_ids/any(g:search.in(g, 'group_id1, group_id2'))"  
}

Example C#

...
var filter = $"group_ids/any(g:search.in(g, '{string.Join(", ", user.Claims.Where(x => x.Type == "groups").Select(x => x.Value))}'))";
 }
 
 SearchOptions searchOption = new SearchOptions
 {
     Filter = filter,
     QueryType = SearchQueryType.Semantic,
     QueryLanguage = "en-us",
     QuerySpeller = "lexicon",
     SemanticConfigurationName = "default",
     Size = top,
     QueryCaption = useSemanticCaptions ? QueryCaptionType.Extractive : QueryCaptionType.None,
 };

var searchResultResponse = await searchClient.SearchAsync<SearchDocument>(query, searchOption, cancellationToken);

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *