microsoft · BenConstable9 · Sep 11, 2024 · Sep 5, 2024 · Sep 5, 2024 · Sep 5, 2024
@@ -0,0 +1,6 @@
+{
+  "recommendations": [
+    "ms-azuretools.vscode-azurefunctions",
+    "ms-python.python"
+  ]
+}
@@ -0,0 +1,15 @@
+{
+  "configurations": [
+    {
+      "connect": {
+        "host": "localhost",
+        "port": 9091
+      },
+      "name": "Attach to Python Functions",
+      "preLaunchTask": "func: host start",
+      "request": "attach",
+      "type": "debugpy"
+    }
+  ],
+  "version": "0.2.0"
+}
@@ -0,0 +1,7 @@
+{
+  "azureFunctions.projectLanguage": "Python",
+  "azureFunctions.projectLanguageModel": 2,
+  "azureFunctions.projectRuntime": "~4",
+  "azureFunctions.scmDoBuildDuringDeployment": true,
+  "debug.internalConsoleOptions": "neverOpen"
+}
@@ -0,0 +1,15 @@
+{
+  "tasks": [
+    {
+      "command": "host start",
+      "isBackground": true,
+      "label": "func: host start",
+      "options": {
+        "cwd": "${workspaceFolder}/ai_search_with_adi/function_app"
+      },
+      "problemMatcher": "$func-python-watch",
+      "type": "func"
+    }
+  ],
+  "version": "2.0.0"
+}
@@ -38,149 +38,21 @@ The properties returned from the ADI Custom Skill are then used to perform the f
 
 ## Provided Notebooks \& Utilities
 
-- `./ai_search.py`, `./deployment.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
+- `./ai_search.py`, `./deploy.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
 - `./function_apps/indexer` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
 - `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
 
+## Deploying AI Search Setup
+
+To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./ai_search/README.md`.
+
 ## ADI Custom Skill
 
 Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
 
 To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
 
-### function_app.py
-
-`./function_apps/indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
-
-### adi_2_aisearch
-
-`./function_apps/indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
-
-#### analyse_document
-
-This method takes the passed file, uploads it to ADI and retrieves the Markdown format.
-
-#### process_figures_from_extracted_content
-
-This method takes the detected figures, and crops them out of the page to save them as images. It uses the `understand_image_with_vlm` to communicate with Azure OpenAI to understand the meaning of the extracted figure.
-
-`update_figure_description` is used to update the original Markdown content with the description and meaning of the figure.
-
-#### clean_adi_markdown
-
-This method performs the final cleaning of the Markdown contents. In this method, the section headings and page numbers are extracted for the content to be returned to the indexer.
-
-### Input Format
-
-The ADI Skill conforms to the [Azure AI Search Custom Skill Input Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-input-json-structure). AI Search will automatically build this format if you use the utility file provided in this repo to build your indexer and skillset.
-
-```json
-{
-    "values": [
-        {
-            "recordId": "0",
-            "data": {
-                "source": "<FULL URI TO BLOB>"
-            }
-        },
-        {
-            "recordId": "1",
-            "data": {
-                "source": "<FULL URI TO BLOB>"
-            }
-        }
-    ]
-}
-```
-
-### Output Format
-
-The ADI Skill conforms to the [Azure AI Search Custom Skill Output Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-output-json-structure).
-
-If `chunk_by_page` header is `True` (recommended):
-
-```json
-{
-    "values": [
-        {
-            "recordId": "0",
-            "data": {
-                "extracted_content": [
-                    {
-                        "page_number": 1,
-                        "sections": [
-                            "<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
-                        ],
-                        "content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 1>"
-                    },
-                    {
-                        "page_number": 2,
-                        "sections": [
-                            "<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 2>"
-                        ],
-                        "content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
-                    }
-                ]
-            }
-        },
-        {
-            "recordId": "1",
-            "data": {
-                "extracted_content": [
-                    {
-                        "page_number": 1,
-                        "sections": [
-                            "<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
-                        ],
-                        "content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
-                    },
-                    {
-                        "page_number": 2,
-                        "sections": [
-                            "<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
-                        ],
-                        "content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
-                    }
-                ]
-            }
-        }
-    ]
-}
-```
-
-If `chunk_by_page` header is `False`:
-
-```json
-{
-    "values": [
-        {
-            "recordId": "0",
-            "data": {
-                "extracted_content": {
-                    "sections": [
-                        "<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
-                    ],
-                    "content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
-                }
-            }
-        },
-        {
-            "recordId": "1",
-            "data": {
-                "extracted_content": {
-                    "sections": [
-                        "<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
-                    ],
-                    "content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
-                }
-            }
-        }
-    ]
-}
-```
-
-**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
-
+Steps for deployment of the function app can be found in `./function_app/README.md`.
 
 ## Production Considerations
 

diff --git a/ai_search_with_adi/ai_search/.env b/ai_search_with_adi/ai_search/.env
@@ -0,0 +1,20 @@
+FunctionApp__Endpoint=<functionAppEndpoint>
+FunctionApp__Key=<functionAppKey>
+FunctionApp__PreEmbeddingCleaner__FunctionName=pre_embedding_cleaner
+FunctionApp__ADI__FunctionName=adi_2_ai_search
+FunctionApp__KeyPhraseExtractor__FunctionName=key_phrase_extractor
+FunctionApp__AppRegistrationResourceId=<App registration in form api://appRegistrationclientId if using identity based connections>
+IdentityType=<identityType> # system_assigned or user_assigned or key
+AIService__AzureSearchOptions__Endpoint=<searchServiceEndpoint>
+AIService__AzureSearchOptions__Identity__ClientId=<clientId if using user assigned identity>
+AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
+AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
+AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
+StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
+StorageAccount__ConnectionString=<connectionString if using non managed identity>
+StorageAccount__RagDocuments__Container=<containerName>
+OpenAI__ApiKey=<openAIKey if using non managed identity>
+OpenAI__Endpoint=<openAIEndpoint>
+OpenAI__EmbeddingModel=<openAIEmbeddingModelName>
+OpenAI__EmbeddingDeployment=<openAIEmbeddingDeploymentId>
+OpenAI__EmbeddingDimensions=1536
diff --git a/ai_search_with_adi/ai_search/README.md b/ai_search_with_adi/ai_search/README.md
@@ -0,0 +1,18 @@
+# AI Search Indexing with Azure Document Intelligence - Pre-built Index Setup
+
+The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
+
+## Steps
+
+1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
+2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
+3. Run `deploy.py` with the following args:
+
+    - `indexer_type rag`. This selects the `rag_documents` sub class.
+    - `enable_page_chunking True`. This determines whether page wise chunking is applied in ADI, or whether the inbuilt skill is used for TextSplit. **Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
+    - `rebuild`. Whether to delete and rebuild the index.
+    - `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
+
+## ai_search.py & environment.py
+
+This includes a variety of helper files and scripts to deploy the index setup. This is useful for CI/CD to avoid having to write JSON files manually or use the UI to deploy the pipeline.