Skip to content

Feature/adi skillset #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
7029a18
changes for adi based skillset
priyal1508 Sep 5, 2024
8d177a0
changes for common scripts
priyal1508 Sep 5, 2024
4617028
fixing bugs
priyal1508 Sep 5, 2024
252c31e
Merge branch 'main' of https://github.com/microsoft/dstoolkit-text2sq…
priyal1508 Sep 5, 2024
de48566
changes in fodler structure
priyal1508 Sep 5, 2024
b4b1409
Merge branch 'main' of https://github.com/microsoft/dstoolkit-text2sq…
priyal1508 Sep 9, 2024
e9a0b8e
adi and indexer changes
priyal1508 Sep 9, 2024
668d9eb
Merge branch 'main' of https://github.com/microsoft/dstoolkit-text2sq…
priyal1508 Sep 9, 2024
ad8684f
Update some of the deployment scripts
BenConstable9 Sep 10, 2024
42adc2a
Update the deployment script
BenConstable9 Sep 10, 2024
3238e64
Temp update of code
BenConstable9 Sep 10, 2024
666203e
Temp update of code
BenConstable9 Sep 10, 2024
5c74a89
Temp update of code
BenConstable9 Sep 10, 2024
a06909d
Refactor envs
BenConstable9 Sep 10, 2024
97f32a6
Add openai setting
BenConstable9 Sep 10, 2024
424e090
Update route
BenConstable9 Sep 10, 2024
fc08689
Fix adi bugs
BenConstable9 Sep 10, 2024
6168995
Remove uneeded code
BenConstable9 Sep 10, 2024
7f27d93
Update the function app code
BenConstable9 Sep 10, 2024
ecabdb2
Update readmes
BenConstable9 Sep 10, 2024
d433851
Restructure
BenConstable9 Sep 10, 2024
cdab428
Update changes
BenConstable9 Sep 10, 2024
e40054f
Fix deployment bugs in indexer
BenConstable9 Sep 10, 2024
a9386a9
Storage account code bug fix
BenConstable9 Sep 10, 2024
0634e24
Further bug fixes
BenConstable9 Sep 10, 2024
c94ef08
Bug fix adi code
BenConstable9 Sep 10, 2024
23ec029
Update prompt
BenConstable9 Sep 10, 2024
bde3d6d
Fix section bugs
BenConstable9 Sep 10, 2024
4ecf2c8
More bug fixes
BenConstable9 Sep 10, 2024
3928edb
Handle new way of finding figures
BenConstable9 Sep 10, 2024
6d7ca86
return offsets
BenConstable9 Sep 10, 2024
41a4237
Update code
BenConstable9 Sep 10, 2024
b8f7984
Fix figure detection
BenConstable9 Sep 10, 2024
0e6c58f
Refactor location
BenConstable9 Sep 10, 2024
3191419
Further refactor
BenConstable9 Sep 10, 2024
a1e8899
Update deployment logic
BenConstable9 Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .vscode/extensions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"recommendations": [
"ms-azuretools.vscode-azurefunctions",
"ms-python.python"
]
}
15 changes: 15 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"configurations": [
{
"connect": {
"host": "localhost",
"port": 9091
},
"name": "Attach to Python Functions",
"preLaunchTask": "func: host start",
"request": "attach",
"type": "debugpy"
}
],
"version": "0.2.0"
}
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"azureFunctions.projectLanguage": "Python",
"azureFunctions.projectLanguageModel": 2,
"azureFunctions.projectRuntime": "~4",
"azureFunctions.scmDoBuildDuringDeployment": true,
"debug.internalConsoleOptions": "neverOpen"
}
15 changes: 15 additions & 0 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"tasks": [
{
"command": "host start",
"isBackground": true,
"label": "func: host start",
"options": {
"cwd": "${workspaceFolder}/ai_search_with_adi/function_app"
},
"problemMatcher": "$func-python-watch",
"type": "func"
}
],
"version": "2.0.0"
}
140 changes: 6 additions & 134 deletions ai_search_with_adi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,149 +38,21 @@ The properties returned from the ADI Custom Skill are then used to perform the f

## Provided Notebooks \& Utilities

- `./ai_search.py`, `./deployment.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
- `./ai_search.py`, `./deploy.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
- `./function_apps/indexer` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.

## Deploying AI Search Setup

To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./ai_search/README.md`.

## ADI Custom Skill

Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.

To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.

### function_app.py

`./function_apps/indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.

### adi_2_aisearch

`./function_apps/indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:

#### analyse_document

This method takes the passed file, uploads it to ADI and retrieves the Markdown format.

#### process_figures_from_extracted_content

This method takes the detected figures, and crops them out of the page to save them as images. It uses the `understand_image_with_vlm` to communicate with Azure OpenAI to understand the meaning of the extracted figure.

`update_figure_description` is used to update the original Markdown content with the description and meaning of the figure.

#### clean_adi_markdown

This method performs the final cleaning of the Markdown contents. In this method, the section headings and page numbers are extracted for the content to be returned to the indexer.

### Input Format

The ADI Skill conforms to the [Azure AI Search Custom Skill Input Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-input-json-structure). AI Search will automatically build this format if you use the utility file provided in this repo to build your indexer and skillset.

```json
{
"values": [
{
"recordId": "0",
"data": {
"source": "<FULL URI TO BLOB>"
}
},
{
"recordId": "1",
"data": {
"source": "<FULL URI TO BLOB>"
}
}
]
}
```

### Output Format

The ADI Skill conforms to the [Azure AI Search Custom Skill Output Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-output-json-structure).

If `chunk_by_page` header is `True` (recommended):

```json
{
"values": [
{
"recordId": "0",
"data": {
"extracted_content": [
{
"page_number": 1,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 1>"
},
{
"page_number": 2,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 2>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
}
]
}
},
{
"recordId": "1",
"data": {
"extracted_content": [
{
"page_number": 1,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
},
{
"page_number": 2,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
}
]
}
}
]
}
```

If `chunk_by_page` header is `False`:

```json
{
"values": [
{
"recordId": "0",
"data": {
"extracted_content": {
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
}
}
},
{
"recordId": "1",
"data": {
"extracted_content": {
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
}
}
}
]
}
```

**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**

Steps for deployment of the function app can be found in `./function_app/README.md`.

## Production Considerations

Expand Down
20 changes: 20 additions & 0 deletions ai_search_with_adi/ai_search/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FunctionApp__Endpoint=<functionAppEndpoint>
FunctionApp__Key=<functionAppKey>
FunctionApp__PreEmbeddingCleaner__FunctionName=pre_embedding_cleaner
FunctionApp__ADI__FunctionName=adi_2_ai_search
FunctionApp__KeyPhraseExtractor__FunctionName=key_phrase_extractor
FunctionApp__AppRegistrationResourceId=<App registration in form api://appRegistrationclientId if using identity based connections>
IdentityType=<identityType> # system_assigned or user_assigned or key
AIService__AzureSearchOptions__Endpoint=<searchServiceEndpoint>
AIService__AzureSearchOptions__Identity__ClientId=<clientId if using user assigned identity>
AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
StorageAccount__ConnectionString=<connectionString if using non managed identity>
StorageAccount__RagDocuments__Container=<containerName>
OpenAI__ApiKey=<openAIKey if using non managed identity>
OpenAI__Endpoint=<openAIEndpoint>
OpenAI__EmbeddingModel=<openAIEmbeddingModelName>
OpenAI__EmbeddingDeployment=<openAIEmbeddingDeploymentId>
OpenAI__EmbeddingDimensions=1536
18 changes: 18 additions & 0 deletions ai_search_with_adi/ai_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# AI Search Indexing with Azure Document Intelligence - Pre-built Index Setup

The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.

## Steps

1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
3. Run `deploy.py` with the following args:

- `indexer_type rag`. This selects the `rag_documents` sub class.
- `enable_page_chunking True`. This determines whether page wise chunking is applied in ADI, or whether the inbuilt skill is used for TextSplit. **Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
- `rebuild`. Whether to delete and rebuild the index.
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.

## ai_search.py & environment.py

This includes a variety of helper files and scripts to deploy the index setup. This is useful for CI/CD to avoid having to write JSON files manually or use the UI to deploy the pipeline.
Loading
Loading