Extract data with Azure Document Intelligence
Azure Document Intelligence is an Azure AI service that enables you to build automated data processing software. This software can extract text, key/value pairs, and tables from form documents using optical character recognition (OCR). Azure Document Intelligence has pre-built models for recognizing invoices, receipts, business cards, and other common document types. The service also provides the capability to train custom models that can extract specific data fields from your own forms.
In this exercise you’ll use both prebuilt and custom Document Intelligence models to extract information from documents.
This exercise takes approximately 45 minutes.
Create a Document Intelligence resource
Azure Document Intelligence is included in Azure AI Services. You’ll create a Document Intelligence resource directly from the Document Intelligence Studio.
- In a web browser, navigate to the Document Intelligence Studio at
https://contentunderstanding.ai.azure.com/documentintelligence/studioand sign in with your Azure credentials. - In the Studio, select the Settings icon (⚙) in the upper-right corner, and then select the Resource tab.
- Select Create a new resource and configure it with the following settings:
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Name: A valid name for your Document Intelligence resource
- Region: Any available region
- Pricing tier: Free F0 (if you don’t have a Free tier available, select Standard S0)
- Select Create and wait for the resource to be deployed. The Studio automatically connects to the new resource.
Use the Read model in the portal
Now let’s use the Read model in the Studio to analyze a multilingual document:
- On the Document Intelligence Studio home page, under Document analysis, select the Read tile.
- In the list of documents on the left, select read-german.pdf.
- At the top toolbar, select Analyze options, then enable the Language check-box (under Optional detection) in the Analyze options pane and select Save.
- At the top-left, select Run Analysis.
- When the analysis is complete, the text extracted from the image is shown on the right in the Content tab. Review this text and compare it to the text in the original image for accuracy.
- Select the Result tab. This tab displays the extracted JSON code.
- Scroll to the bottom of the JSON code in the Result tab. Notice that the read model has detected the language of each span indicated by
locale. Most spans are in German (language codede) but you can find other language codes in the spans (e.g., English — language codeen— in one of the first spans).
Analyze an invoice with a prebuilt model using the Python SDK
Now let’s use the Document Intelligence Python SDK to analyze an invoice programmatically.
Prepare the development environment
- In the Azure portal, find the Document Intelligence resource you created earlier. Under Resource Management, select Keys and Endpoint, and note the Endpoint and one of the Keys. You’ll need these values shortly.
- Start Visual Studio Code.
- Open the Command Palette (press Ctrl+Shift+P), type Git: Clone, and select it.
-
In the URL bar, paste the following repository URL and press Enter:
https://github.com/microsoftlearning/mslearn-ai-information-extraction - Choose a local folder to clone into, and then when prompted, select Open to open the cloned repository in VS Code.
-
Open a new terminal and navigate to the prebuilt Document Intelligence folder:
cd Labfiles/03-document-intelligence/prebuilt/Python -
Install the required libraries:
python -m venv labenv labenv\Scripts\activate pip install -r requirements.txtNote: The requirements.txt installs the azure-ai-documentintelligence Python SDK package and its dependencies.
- In the VS Code Explorer pane, open the .env file in Labfiles/03-document-intelligence/prebuilt/Python.
- In the file, replace the YOUR_ENDPOINT and YOUR_KEY placeholders with your Document Intelligence resource endpoint and API key.
- Save the file (CTRL+S).
Add code to analyze an invoice
This is the sample invoice that your code will analyze:

-
In VS Code, open the document-analysis.py file.
-
In the code file, find the comment Import the required libraries and add the following code:
# Add references from azure.identity import DefaultAzureCredential from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.ai.documentintelligence.models import AnalyzeDocumentRequest -
Find the comment Create the client and add the following code (being careful to maintain the correct indentation):
# Create the client document_analysis_client = DocumentIntelligenceClient( endpoint=endpoint, credential=DefaultAzureCredential() ) -
Find the comment Analyze the invoice and add the following code:
# Analyze the invoice poller = document_analysis_client.begin_analyze_document( "prebuilt-invoice", AnalyzeDocumentRequest(url_source=fileUri), locale=fileLocale ) -
Find the comment Display invoice information to the user and add the following code:
# Display invoice information to the user result = poller.result() for document in result.documents: vendor_name = document.fields.get("VendorName") if vendor_name: print(f"\nVendor Name: {vendor_name.get('valueString')}, with confidence {vendor_name.get('confidence')}.") customer_name = document.fields.get("CustomerName") if customer_name: print(f"Customer Name: {customer_name.get('valueString')}, with confidence {customer_name.get('confidence')}.") invoice_total = document.fields.get("InvoiceTotal") if invoice_total: amount = invoice_total.get("valueCurrency", {}) print(f"Invoice Total: {amount.get('currencySymbol', '$')}{amount.get('amount')}, with confidence {invoice_total.get('confidence')}.") - Review the code you added, which:
- Creates a
DocumentIntelligenceClientwith your endpoint and credentials. - Uses the
prebuilt-invoicemodel to analyze the document from a URL. - Iterates through the results and prints the vendor name, customer name, and invoice total.
- Creates a
- Save the file (CTRL+S).
-
In the VS Code terminal, run the application:
python document-analysis.py - Review the output. The program should display the vendor name, customer name, and invoice total with confidence levels. Compare the values with the sample invoice shown above.
Train and test a custom model
The prebuilt models are useful for common document types, but often you need to extract specific data from your own forms. You can train a custom Document Intelligence model to extract the specific fields you need.
Prepare training data
A setup script has been provided to create a storage account and upload sample forms for training.
-
In the VS Code terminal, navigate to the custom model folder:
cd ../../customTip: If you’re unsure of your current directory, run
cdto check. -
In VS Code, open the setup.sh file in Labfiles/03-document-intelligence/custom.
- Review the commands in the script. It will:
- Create a storage account in your Azure resource group
- Upload files from the sample-forms folder to a container
- Print a Shared Access Signature (SAS) URI
-
Modify the subscription_id, resource_group, and location variable declarations with the appropriate values for the subscription, resource group, and location name where you deployed the Document Intelligence resource.
Important: For your location string, use the code format (e.g.,
eastusfor “East US”). You can find this in the JSON View of your resource group in the Azure portal.If the expiry_date variable is in the past, update it to a future date, for example
2026-12-31. - Save the file (CTRL+S).
- To run the setup script, you need a Bash shell. You can use one of the following options, but be sure you’re logged in to your Azure account:
- Azure Cloud Shell: In the Azure portal, open a Cloud Shell (Bash), navigate to the folder, and run
./setup.sh. - VS Code terminal (with WSL or Git Bash on Windows): Run
bash setup.sh.
- Azure Cloud Shell: In the Azure portal, open a Cloud Shell (Bash), navigate to the folder, and run
- When the script completes, review the displayed output.
- In the Azure portal, refresh your resource group and verify that the storage account was created. Open the storage account and in Storage browser, expand Blob containers and select the sampleforms container to confirm the files were uploaded.
Train the model in Document Intelligence Studio
Now you’ll use the training forms to build a custom extraction model.
- Open a new browser tab and navigate to the Document Intelligence Studio at
https://documentintelligence.ai.azure.com/studio. - Scroll down to the Custom models section and select the Custom extraction model tile.
- If prompted, sign in with your Azure credentials.
- If asked which Azure Document Intelligence resource to use, select the subscription and resource name you used when you created the resource.
-
Under My Projects, create a new project with the following configuration:
- Enter project details:
- Project name: A valid name for your project
- Configure service resource:
- Subscription: Your Azure subscription
- Resource group: The resource group of your Document Intelligence resource
- Document Intelligence resource: Your Document Intelligence resource (select the Set as default option and use the default API version)
- Connect training data source:
- Subscription: Your Azure subscription
- Resource group: Your resource group
- Storage account: The storage account created by the setup script (select the Set as default option, select the
sampleformsblob container, and leave the folder path blank)
- Enter project details:
- When your project is created, on the top right of the page, select Train to train your model. Use the following configuration:
- Model ID: A valid name for your model — note it down for later
- Build Mode: Template
- Select Go to Models.
- Training may take some time. Wait until the model status shows succeeded.
Test the custom model with the Python SDK
-
In the VS Code terminal, navigate to the custom model Python folder:
cd Python -
Install the required packages (create a new virtual environment or reuse the existing one):
python -m venv labenv labenv\Scripts\activate pip install -r requirements.txt -
In VS Code, open the .env file in Labfiles/03-document-intelligence/custom/Python.
- Update the file with the following values:
- Your Document Intelligence endpoint
- Your Document Intelligence key
- The Model ID you specified when training your model
- Save the file (CTRL+S).
-
In VS Code, open the test-model.py file.
- Review the code, which uses the azure-ai-documentintelligence SDK. Notice that it references a test image hosted in the GitHub repo. The code creates a
DocumentIntelligenceClient, submits the image for analysis using your custom model, and prints the extracted fields. -
In the VS Code terminal, run the program:
python test-model.py -
Review the output. The program should display the field names and values extracted from the test form, such as
Merchant,CompanyPhoneNumber, and other fields you defined during training.
Clean up
If you’ve finished working with the Document Intelligence service, you should delete the resources you created in this exercise to avoid incurring unnecessary Azure costs.
- In the Azure portal, delete the resource group you created for this exercise.