Python, векторы фильтров из векторного хранилища Pinecone на основе поля, сохраненного в метаданных этих векторов

Python, векторы фильтров из векторного хранилища Pinecone на основе поля, сохраненного в метаданных этих векторов ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Python, векторы фильтров из векторного хранилища Pinecone на основе поля, сохраненного в метаданных этих векторов

Цитата

Сообщение Anonymous » 25 янв 2025, 22:08

У меня есть векторы, хранящиеся в хранилище векторов Pinecone, каждый вектор представляет собой содержимое PDF-файла:

Метаданные::
hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d"
id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

Я сохранил новое поле в метаданных, которое представляет собой хеш-код PDF-файла. содержимое, чтобы избежать повторного добавления одного и того же файла в векторное хранилище.
Для этого я получаю новые хэш-коды новых документов, которые мне нужны. чтобы добавить, я хочу просканировать существующие, чтобы найти, существует ли какой-либо из них, а затем отфильтровать его.
Я использую Python и пробовал такой код, но не получилось мне пока не удалось достичь цели:
Первый способ:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# Fetch by list of hash_codes (ensure hash_codes are valid ids)
fetch_response = index.fetch(ids=hash_codes)
print("Fetch Response:", fetch_response)

# Get the existing hash_codes that are already in the Pinecone index
existing_hash_codes = set(fetch_response.get('vectors', {}).keys()) # Extract existing IDs from the response
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone
filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

Затем попробовал другой подход:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
top_k=100, # Set a suitable `top_k` to return a reasonable number of documents
include_metadata=True,
#namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

Подробнее здесь: https://stackoverflow.com/questions/793 ... ved-in-the

1737832118

Anonymous

У меня есть векторы, хранящиеся в хранилище векторов Pinecone, каждый вектор представляет собой содержимое PDF-файла:

Метаданные::
hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d"
id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

Я сохранил новое поле в метаданных, которое представляет собой хеш-код PDF-файла. содержимое, чтобы избежать повторного добавления одного и того же файла в векторное хранилище.
Для этого я получаю новые хэш-коды новых документов, которые мне нужны. чтобы добавить, я хочу просканировать существующие, чтобы найти, существует ли какой-либо из них, а затем отфильтровать его.
Я использую Python и пробовал такой код, но не получилось мне пока не удалось достичь цели:
Первый способ:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# Fetch by list of hash_codes (ensure hash_codes are valid ids)
fetch_response = index.fetch(ids=hash_codes)
print("Fetch Response:", fetch_response)

# Get the existing hash_codes that are already in the Pinecone index
existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone
filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

Затем попробовал другой подход:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
include_metadata=True,
#namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs
 

Подробнее здесь: [url]https://stackoverflow.com/questions/79387289/python-filter-vectors-from-pinecone-vector-store-based-on-a-field-saved-in-the[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Python, фильтровать векторы из хранилища векторов Pinecone на основе поля, сохраненного в метаданных этих векторов.

Последнее сообщение Anonymous « 26 янв 2025, 10:44
Добавлено в форуме Python

Anonymous » 26 янв 2025, 10:44 » в форуме Python

У меня есть векторы, хранящиеся в векторном хранилище Pinecone, каждый вектор представляет содержание файла PDF:

metadata ::
hash_code: D53D7EC8B0E66E9A83A97ACDA09EDD3FE9867CADB42833F9BF5525CC3B89FE2D
ID: CC54FFBE-9CBA-4DE9-9F30-A114E4C3C3FB поле...

0 Ответы

18 Просмотры

Последнее сообщение Anonymous
26 янв 2025, 10:44
ValueError: клиент должен быть экземпляром pinecone.index, Got

Последнее сообщение Anonymous « 21 фев 2025, 19:37
Добавлено в форуме Python

Anonymous » 21 фев 2025, 19:37 » в форуме Python

Помогите мне исправить:
import os
from pinecone import Pinecone, ServerlessSpec
from langchain.vectorstores import Pinecone as PineconeLangchain

os.environ = PINECONE_API_KEY

pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = medchat

if...

0 Ответы

80 Просмотры

Последнее сообщение Anonymous
21 фев 2025, 19:37
Проверьте, когда векторы загружаются в пространство имен Pinecone.

Последнее сообщение Anonymous « 24 окт 2024, 19:33
Добавлено в форуме Python

Anonymous » 24 окт 2024, 19:33 » в форуме Python

У меня настроен индекс сосновой шишки, и я использую Langchain для загрузки подготовленных документов в определенное пространство имен.
Моя проблема в том, что я возвращаю PineconeVectorStore и передаю его в качестве параметра другому классу; однако...

0 Ответы

13 Просмотры

Последнее сообщение Anonymous
24 окт 2024, 19:33
Индексирование векторного хранилища в langchain

Последнее сообщение Anonymous « 08 дек 2023, 12:31
Добавлено в форуме Python

Anonymous » 08 дек 2023, 12:31 » в форуме Python

Мне нужно получить доступ к хранилищу векторов Langchain, используя index. Есть ли какой-нибудь способ для этого?

Мой вариант использования — выбор случайных документов из векторного хранилища. Одним из решений является создание случайного...

0 Ответы

42 Просмотры

Последнее сообщение Anonymous
08 дек 2023, 12:31
Как передать более двух значений в средство извлечения векторного хранилища Hyde Document Embedding?

Последнее сообщение Anonymous « 22 июн 2024, 08:12
Добавлено в форуме Python

Anonymous » 22 июн 2024, 08:12 » в форуме Python

Я пытаюсь создать встраивание требования на основе гида и получить из него соответствующие документы. Затем эти соответствующие документы будут переданы в качестве контекста и исходного вопроса. Подсказка к гиду отличается от той, которую я...

0 Ответы

21 Просмотры

Последнее сообщение Anonymous
22 июн 2024, 08:12

Вернуться в «Python»