Поисковик PDF-файлов Python переполняет оперативную память

Поисковик PDF-файлов Python переполняет оперативную память ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Поисковик PDF-файлов Python переполняет оперативную память

Цитата

Сообщение Anonymous » 04 авг 2024, 07:56

В рамках моей программы я пытаюсь использовать стороннюю библиотеку pdfminer на Python, чтобы открывать и читать страницы PDF, а затем использовать регулярные выражения для поиска определенных шаблонов. Я также использую многопроцессорную обработку для распараллеливания этого процесса, поскольку мне нужно проанализировать большое количество PDF-файлов. Каждый процесс должен обрабатывать один PDF-файл.
У меня есть этот код для настройки многопроцессорности:

Код: Выделить всё

def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
"""
Processes files of a specific theme in a multiprocessed manner.

Parameters:
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.

Returns:
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
"""
results = []
exceptions = []

# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()

# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
processed_files = set()
if theme_processed_files is None:
theme_processed_files = set()

# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))

# Create progress bar
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}

# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
try:
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
except Exception as e:
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")

# Update the progress bar
pbar.update(1)

return results, exceptions

И этот код для обработки отдельных файлов:

Код: Выделить всё

def process_file(file_path: Path):
"""
Process a PDF file to extract text and information.

Args:
- file_path (Path): Path object representing the location of the PDF file.

Returns:
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2.  List of encountered exceptions during processing.

Raises:
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
"""
results = []        # List to store extracted information from each page
exceptions = []     # List to store exceptions encountered during processing

try:
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)

# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions

# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())

# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])

# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1)  # Page numbers are 1-based

# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)

except FileNotFoundError as e:
# Handle case where the file does not exist
exceptions.append(e)
print(f"FileNotFoundError: {e}")
raise

except Exception as e:
# Handle any other unexpected exceptions
exceptions.append(e)
print(f"Exception: {e}")
raise

return results, exceptions

Проблема в том, что мне не хватает оперативной памяти, даже если установлено 32 ГБ:

Благодаря своим исследованиям я узнал, что PDF-файлы нельзя читать случайным образом; их нужно читать последовательно от начала до конца файла, как я это и реализовал.
Некоторые из моих PDF-файлов имеют размер около 100 МБ, но никогда не превышают 200 МБ, и некоторые из них довольно длинные (1000 страниц) и содержат множество изображений. Поскольку мне приходится читать все страницы при обработке PDF-файла, единственным обходным решением, которое я смог найти, было ограничение размера PDF-файлов, которые я читаю, до уровня менее 100 МБ. Я также не могу придумать, как ограничить количество страниц - потому что, чтобы определить количество страниц, мне нужно открыть и прочитать файл.
Как я могу ограничить Использование оперативной памяти в этой программе?

Подробнее здесь: https://stackoverflow.com/questions/788 ... ws-the-ram

1722747394

Anonymous

В рамках моей программы я пытаюсь использовать стороннюю библиотеку pdfminer на Python, чтобы открывать и читать страницы PDF, а затем использовать регулярные выражения для поиска определенных шаблонов. Я также использую многопроцессорную обработку для распараллеливания этого процесса, поскольку мне нужно проанализировать большое количество PDF-файлов. Каждый процесс должен обрабатывать один PDF-файл.
У меня есть этот код для настройки многопроцессорности:
[code]def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
"""
Processes files of a specific theme in a multiprocessed manner.

Parameters:
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.

Returns:
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
"""
results = []
exceptions = []

# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()

# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
processed_files = set()
if theme_processed_files is None:
theme_processed_files = set()

# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))

# Create progress bar
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}

# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
try:
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
except Exception as e:
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")

# Update the progress bar
pbar.update(1)

return results, exceptions
[/code]
И этот код для обработки отдельных файлов:
[code]def process_file(file_path: Path):
"""
Process a PDF file to extract text and information.

Args:
- file_path (Path): Path object representing the location of the PDF file.

Returns:
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2.  List of encountered exceptions during processing.

Raises:
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
"""
results = []        # List to store extracted information from each page
exceptions = []     # List to store exceptions encountered during processing

try:
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)

# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions

# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())

# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])

# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1)  # Page numbers are 1-based

# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)

except FileNotFoundError as e:
# Handle case where the file does not exist
exceptions.append(e)
print(f"FileNotFoundError: {e}")
raise

except Exception as e:
# Handle any other unexpected exceptions
exceptions.append(e)
print(f"Exception: {e}")
raise

return results, exceptions
[/code]
Проблема в том, что мне не хватает оперативной памяти, даже если установлено 32 ГБ:
[img]https://i.sstatic.net/. iVDuqWKj.png[/img]

Благодаря своим исследованиям я узнал, что PDF-файлы нельзя читать случайным образом; их нужно читать последовательно от начала до конца файла, как я это и реализовал.
Некоторые из моих PDF-файлов имеют размер около 100 МБ, но никогда не превышают 200 МБ, и некоторые из них довольно длинные (1000 страниц) и содержат множество изображений. Поскольку мне приходится читать все страницы при обработке PDF-файла, единственным обходным решением, которое я смог найти, было ограничение размера PDF-файлов, которые я читаю, до уровня менее 100 МБ. Я также не могу придумать, как ограничить количество страниц - потому что, чтобы определить количество страниц, мне нужно открыть и прочитать файл.
[b]Как я могу ограничить Использование оперативной памяти в этой программе[/b]? 

Подробнее здесь: [url]https://stackoverflow.com/questions/78829945/python-pdf-searcher-overflows-the-ram[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

В Debian (Kali) есть команда «график», позволяющая увидеть температуру, оперативную память, память и т. д.?

Последнее сообщение Anonymous « 17 янв 2025, 08:23
Добавлено в форуме Linux

Anonymous » 17 янв 2025, 08:23 » в форуме Linux

Я хочу видеть в реальном времени использование памяти, оперативной памяти, процессора и температуру на графике, например, как «net_graph» в CS 1.6. есть команда или программа, чтобы увидеть это?
Я использую Kali GNU/Linux kali-rolling 2024.4...

0 Ответы

23 Просмотры

Последнее сообщение Anonymous
17 янв 2025, 08:23
Процесс Python celerybeat[worker] пополняет оперативную память

Последнее сообщение Anonymous « 01 окт 2024, 14:00
Добавлено в форуме Python

Anonymous » 01 окт 2024, 14:00 » в форуме Python

У меня есть проект fastapi с ритмом сельдерея, который запускается каждые 30 минут. Он анализирует новости с RSS-APIS веб-сайта и записывает их в базу данных.
Вот мои конфигурации сельдерея:

configs.py
...
# Celery configuration
@property
def...

0 Ответы

15 Просмотры

Последнее сообщение Anonymous
01 окт 2024, 14:00
Идентификатор аутентификации токена не запускает настраиваемый поисковик

Последнее сообщение Anonymous « 15 май 2024, 21:22
Добавлено в форуме Php

Anonymous » 15 май 2024, 21:22 » в форуме Php

Я пытаюсь завершить настройку аутентификации по токену. Однако я продолжаю получать ошибку 401.
Я подтвердил, что настроенный пользовательский поисковик не срабатывает, поскольку file_put_contents не выводит содержимое в указанный файл. В настоящее...

0 Ответы

14 Просмотры

Последнее сообщение Anonymous
15 май 2024, 21:22
Как непрерывно сохранять локально большие данные из пошаговой потоковой передачи, не перегружая оперативную память?

Последнее сообщение Гость « 01 мар 2024, 15:19
Добавлено в форуме Python

Гость » 01 мар 2024, 15:19 » в форуме Python

Я хочу сохранить локально все тиковые данные от брокера, не перегружая оперативную память. Трансляция начинается в воскресенье и заканчивается в первые часы субботы. В любое время в течение недели я хочу иметь возможность загрузить файл с жесткого...

0 Ответы

35 Просмотры

Последнее сообщение Гость
01 мар 2024, 15:19
Лама использует всю оперативную память, вызывая смерть ядра

Последнее сообщение Anonymous « 18 ноя 2024, 16:53
Добавлено в форуме Python

Anonymous » 18 ноя 2024, 16:53 » в форуме Python

import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig
from trl import SFTTrainer

dataset =...

0 Ответы

27 Просмотры

Последнее сообщение Anonymous
18 ноя 2024, 16:53

Вернуться в «Python»