Проблема с настройкой векторной памяти FAISS в Python с встраиваниями

Проблема с настройкой векторной памяти FAISS в Python с встраиваниями ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Проблема с настройкой векторной памяти FAISS в Python с встраиваниями

Цитата

Сообщение Anonymous » 29 дек 2024, 19:56

Я пытаюсь запустить LLM локально и загрузить в него содержимое очень большого PDF-файла. Я решил попробовать это через RAG. Для этого я хотел создать векторный магазин, содержащий содержимое PDF-файла. однако при создании у меня возникла проблема, которую я не могу решить, так как я еще совсем новичок в этой области.
Проблема в том, что я использую FAISS и не знаю, как чтобы передать свои значения в .from_embeddings.
В результате я уже получил несколько ошибок.
Мой код выглядит так:

Код: Выделить всё

import re
import PyPDF2
from nltk.tokenize import sent_tokenize  # After downloading resources
from sentence_transformers import SentenceTransformer
from langchain_community.vectorstores import FAISS  # Updated import

def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF file.

Args:
pdf_path (str): Path to the PDF file.

Returns:
str: Extracted text from the PDF.
"""

with open(pdf_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text

if __name__ == "__main__":
pdf_path = ""  # Replace with your actual path

text = extract_text_from_pdf(pdf_path)
print("Text extracted from PDF file successfully.")

# Preprocess text to remove special characters
text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII characters

sentences = sent_tokenize(text)
print(sentences)  # Print the extracted sentences

# Filter out empty sentences (optional)
sentences = [sentence for sentence in sentences if sentence.strip()]

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Ensure model.encode(sentences) returns a list of NumPy arrays
embeddings = model.encode(sentences)

vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)#problem here
print("Vector store created successfully.")

# Example search query (replace with your actual question)
query = "Was sind die wichtigsten Worte?"
search_results = vectorstore.search(query)
print("Search results:")
for result in search_results:
print(result)

Если я выполняю код в том виде, в котором он есть, возникает следующая ошибка:

Код: Выделить всё

Traceback (most recent call last):
File “/Users/user/PycharmProjects/PythonProject/extract_pdf_text.py”, line 53, in 
vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: FAISS.from_embeddings() missing 1 required positional argument: 'embedding'

Однако, если я сейчас напишу Vectorstore = FAISS.from_embeddings(embedding=embeddings, предложения_list=sentences), то параметр text_embeddings будет отсутствовать
Как мне нужно заполнить параметры, чтобы я мог это использовать, или есть лучший способ реализовать это?

Подробнее здесь: https://stackoverflow.com/questions/793 ... embeddings

1735491369

Anonymous

Я пытаюсь запустить LLM локально и загрузить в него содержимое очень большого PDF-файла. Я решил попробовать это через RAG. Для этого я хотел создать векторный магазин, содержащий содержимое PDF-файла. однако при создании у меня возникла проблема, которую я не могу решить, так как я еще совсем новичок в этой области.
Проблема в том, что я использую FAISS и не знаю, как чтобы передать свои значения в .from_embeddings.
В результате я уже получил несколько ошибок.
Мой код выглядит так:
[code]import re
import PyPDF2
from nltk.tokenize import sent_tokenize  # After downloading resources
from sentence_transformers import SentenceTransformer
from langchain_community.vectorstores import FAISS  # Updated import

def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF file.

Args:
pdf_path (str): Path to the PDF file.

Returns:
str: Extracted text from the PDF.
"""

with open(pdf_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text

if __name__ == "__main__":
pdf_path = ""  # Replace with your actual path

text = extract_text_from_pdf(pdf_path)
print("Text extracted from PDF file successfully.")

# Preprocess text to remove special characters
text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII characters

sentences = sent_tokenize(text)
print(sentences)  # Print the extracted sentences

# Filter out empty sentences (optional)
sentences = [sentence for sentence in sentences if sentence.strip()]

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Ensure model.encode(sentences) returns a list of NumPy arrays
embeddings = model.encode(sentences)

vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)#problem here
print("Vector store created successfully.")

# Example search query (replace with your actual question)
query = "Was sind die wichtigsten Worte?"
search_results = vectorstore.search(query)
print("Search results:")
for result in search_results:
print(result)
[/code]
Если я выполняю код в том виде, в котором он есть, возникает следующая ошибка:
[code]Traceback (most recent call last):
File “/Users/user/PycharmProjects/PythonProject/extract_pdf_text.py”, line 53, in 
vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: FAISS.from_embeddings() missing 1 required positional argument: 'embedding'
[/code]
Однако, если я сейчас напишу Vectorstore = FAISS.from_embeddings(embedding=embeddings, предложения_list=sentences), то параметр text_embeddings будет отсутствовать
Как мне нужно заполнить параметры, чтобы я мог это использовать, или есть лучший способ реализовать это? 

Подробнее здесь: [url]https://stackoverflow.com/questions/79315980/problem-setting-up-a-faiss-vector-memory-in-python-with-embeddings[/url]