Как извлечь ценную информацию из вывода JSON пользовательских экстракторов Document AI? - Цифровое Кемерово

Как извлечь ценную информацию из вывода JSON пользовательских экстракторов Document AI? ⇐ Python

Ответить

1 сообщение • Страница 1 из 1

Anonymous

Как извлечь ценную информацию из вывода JSON пользовательских экстракторов Document AI?

Цитата

Сообщение Anonymous » 26 ноя 2024, 19:37

Я работаю с простым пользовательским экстрактором в Document AI, который пытается найти следующие поля в любом загруженном PDF-файле:
Страна
Номер
Адрес
Страна
Почта
Адрес
Город
И я использую следующий код для извлечения информации и печати выходного JSON.

Код: Выделить всё

from google.cloud import documentai_v1 as documentai
import json
import os
from google.colab import files

# Credentials setup (assuming you've uploaded the service account key)
uploaded = files.upload()
key_file = list(uploaded.keys())[0]
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = key_file

# Configuration
PROJECT_ID = "682656916911"  # Replace with your project ID
LOCATION = "eu"  # Use the correct region
PROCESSOR_ID = "da26d6ce1aa73a53"  # Replace with your processor ID
DOCUMENT_PATH = "/content/W-8BEN.pdf"  # Path to your document

# Client setup
client_options = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=client_options)

# Request preparation
name = f"projects/{PROJECT_ID}/locations/{LOCATION}/processors/{PROCESSOR_ID}"
with open(DOCUMENT_PATH, "rb") as document_file:
document_content = document_file.read()

request = {
"name": name,
"raw_document": {
"content": document_content,
"mime_type": "application/pdf"
}
}

# Process document
response = client.process_document(request=request)

# Print response and extracted text
print(f"Response type: {type(response)}")
print(f"Document type: {type(response.document)}")

if document := response.document:  # Use walrus operator for cleaner assignment
print("\nExtracted text:")
print(document.text)

# Convert Document object to dictionary (avoiding DESCRIPTOR field)
document_dict = documentai.Document.to_dict(document)
print("\nJSON representation of extracted data (excluding DESCRIPTOR):")
print(json.dumps(document_dict, indent=4))

Однако вывод JSON для меня бесполезен. Есть только координаты и много шума и ненужной информации. Вот лишь фрагмент:

Код: Выделить всё

{
"layout": {
"text_anchor": {
"text_segments": [
{
"start_index": "4392",
"end_index": "4396"
}
],
"content": ""
},
"confidence": 0.98868006,
"bounding_poly": {
"vertices": [
{
"x": 522,
"y": 1972
},
{
"x": 549,
"y": 1972
},
{
"x": 549,
"y": 1994
},
{
"x": 522,
"y": 1994
}
],
"normalized_vertices": [
{
"x": 0.29692832,
"y": 0.8668132
},
{
"x": 0.31228667,
"y": 0.8668132
},
{
"x": 0.31228667,
"y": 0.8764835
},
{
"x": 0.29692832,
"y": 0.8764835
}
]
},
"orientation": 1
},
"detected_break": {
"type_": 1
},
"detected_languages": [
{
"language_code": "en",
"confidence": 1.0
}
]
},

Я пытался отфильтровать нужные поля, но не смог это сделать. Я хочу иметь возможность видеть пару ключ-значение для проверки загруженного PDF-файла или нет и уменьшить весь этот шум.
Заранее спасибо.

Подробнее здесь: https://stackoverflow.com/questions/792 ... i-custom-e

1732639032

Anonymous

Я работаю с простым пользовательским экстрактором в Document AI, который пытается найти следующие поля в любом загруженном PDF-файле:
Страна
Номер
Адрес
Страна
Почта
Адрес
Город
И я использую следующий код для извлечения информации и печати выходного JSON.
[code]from google.cloud import documentai_v1 as documentai
import json
import os
from google.colab import files

# Credentials setup (assuming you've uploaded the service account key)
uploaded = files.upload()
key_file = list(uploaded.keys())[0]
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = key_file

# Configuration
PROJECT_ID = "682656916911"  # Replace with your project ID
LOCATION = "eu"  # Use the correct region
PROCESSOR_ID = "da26d6ce1aa73a53"  # Replace with your processor ID
DOCUMENT_PATH = "/content/W-8BEN.pdf"  # Path to your document

# Client setup
client_options = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=client_options)

# Request preparation
name = f"projects/{PROJECT_ID}/locations/{LOCATION}/processors/{PROCESSOR_ID}"
with open(DOCUMENT_PATH, "rb") as document_file:
document_content = document_file.read()

request = {
"name": name,
"raw_document": {
"content": document_content,
"mime_type": "application/pdf"
}
}

# Process document
response = client.process_document(request=request)

# Print response and extracted text
print(f"Response type: {type(response)}")
print(f"Document type: {type(response.document)}")

if document := response.document:  # Use walrus operator for cleaner assignment
print("\nExtracted text:")
print(document.text)

# Convert Document object to dictionary (avoiding DESCRIPTOR field)
document_dict = documentai.Document.to_dict(document)
print("\nJSON representation of extracted data (excluding DESCRIPTOR):")
print(json.dumps(document_dict, indent=4))
[/code]
Однако вывод JSON для меня бесполезен. Есть только координаты и много шума и ненужной информации.  Вот лишь фрагмент:
[code]{
"layout": {
"text_anchor": {
"text_segments": [
{
"start_index": "4392",
"end_index": "4396"
}
],
"content": ""
},
"confidence": 0.98868006,
"bounding_poly": {
"vertices": [
{
"x": 522,
"y": 1972
},
{
"x": 549,
"y": 1972
},
{
"x": 549,
"y": 1994
},
{
"x": 522,
"y": 1994
}
],
"normalized_vertices": [
{
"x": 0.29692832,
"y": 0.8668132
},
{
"x": 0.31228667,
"y": 0.8668132
},
{
"x": 0.31228667,
"y": 0.8764835
},
{
"x": 0.29692832,
"y": 0.8764835
}
]
},
"orientation": 1
},
"detected_break": {
"type_": 1
},
"detected_languages": [
{
"language_code": "en",
"confidence": 1.0
}
]
},
[/code]
Я пытался отфильтровать нужные поля, но не смог это сделать. Я хочу иметь возможность видеть пару ключ-значение для проверки загруженного PDF-файла или нет и уменьшить весь этот шум.
Заранее спасибо. 

Подробнее здесь: [url]https://stackoverflow.com/questions/79227578/how-to-extract-valuable-information-from-the-json-output-of-document-ai-custom-e[/url]

Ответить

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Вернуться в «Python»