Обнаружение и удаление ссылок из PDF-файла, если заголовок раздела «Ссылки» или «Библиография» отсутствует

Обнаружение и удаление ссылок из PDF-файла, если заголовок раздела «Ссылки» или «Библиография» отсутствует – Python ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Обнаружение и удаление ссылок из PDF-файла, если заголовок раздела «Ссылки» или «Библиография» отсутствует – Python

Цитата

Сообщение Anonymous » 25 ноя 2024, 16:59

Я составил следующий код, чтобы извлечь текст рукописи из онлайн-файла PDF (2 столбца). Поскольку нет четкого названия раздела «Ссылки» или «Библиография», я несколько раз пытался обнаружить ссылки и удалить их из основного текста извлеченной рукописи, но безуспешно. Есть ли у вас предложения, как удалить ссылки, когда заголовок раздела «Ссылки» или «Библиография» отсутствует?
Это код, который извлекает текст рукописи из https: //hal.science/hal-04206682/document:

Код: Выделить всё

import fitz  # PyMuPDF
import requests
import io

def extract_text_from_pdf(pdf_file):
# Open the PDF file from the stream
doc = fitz.open(stream=pdf_file, filetype="pdf")
full_text = []

for page_num in range(len(doc)):
page = doc[page_num]
blocks = page.get_text("dict")["blocks"]

# Analyze the page width to detect column structure
page_width = page.rect.width
mid_x = page_width / 2  # Middle of the page for splitting columns

left_column = []
right_column = []

for block in blocks:
if "bbox" in block:
x0, y0, x1, y1 = block["bbox"]  # Extract block bounding box

# Classify blocks into left or right columns
if x1 = mid_x:
right_column.append(block)

# Sort blocks by their vertical position (top) within each column
left_column.sort(key=lambda b: b["bbox"][1])
right_column.sort(key=lambda b: b["bbox"][1])

# Extract text from each column and concatenate
page_text = []
for column in [left_column, right_column]:
for block in column:
if "lines" in block:
block_text = ""
for line in block["lines"]:
for span in line["spans"]:
block_text += span["text"] + " "
page_text.append(block_text.strip())

# Combine text from both columns into page text
full_text.append("\n".join(page_text))

return "\n\n".join(full_text)

# Fetch the PDF from the URL
url = 'https://hal.science/hal-04206682/document'

try:
response = requests.get(url)
response.raise_for_status()  # Raise an error if the request failed
pdf_file = io.BytesIO(response.content)  # Load the PDF content into memory

# Extract text from the PDF
text = extract_text_from_pdf(pdf_file)
print(text)
except requests.exceptions.RequestException as e:
print(f"Error downloading the PDF: {e}")

Примечание, не имеющее отношения к моему вопросу, но может быть полезно читателям: код способен правильно извлекать код, соответствующий структуре PDF с двумя столбцами. Однако он не делает различия между текстом «Рисунок» и основным текстом рукописи, поэтому добавляет текст «Рисунок» между основным текстом рукописи, где встречается «Рисунок». И я не знаю, как удалить

Код: Выделить всё

enter code here

текст «Рисунок».

Подробнее здесь: https://stackoverflow.com/questions/792 ... erences-or

1732543144

Anonymous

Я составил следующий код, чтобы извлечь текст рукописи из онлайн-файла PDF (2 столбца). Поскольку нет четкого названия раздела «Ссылки» или «Библиография», я несколько раз пытался обнаружить ссылки и удалить их из основного текста извлеченной рукописи, но безуспешно. Есть ли у вас предложения, как удалить ссылки, когда заголовок раздела «Ссылки» или «Библиография» отсутствует?
Это код, который извлекает текст рукописи из https: //hal.science/hal-04206682/document:
[code]import fitz  # PyMuPDF
import requests
import io

def extract_text_from_pdf(pdf_file):
# Open the PDF file from the stream
doc = fitz.open(stream=pdf_file, filetype="pdf")
full_text = []

for page_num in range(len(doc)):
page = doc[page_num]
blocks = page.get_text("dict")["blocks"]

# Analyze the page width to detect column structure
page_width = page.rect.width
mid_x = page_width / 2  # Middle of the page for splitting columns

left_column = []
right_column = []

for block in blocks:
if "bbox" in block:
x0, y0, x1, y1 = block["bbox"]  # Extract block bounding box

# Classify blocks into left or right columns
if x1 = mid_x:
right_column.append(block)

# Sort blocks by their vertical position (top) within each column
left_column.sort(key=lambda b: b["bbox"][1])
right_column.sort(key=lambda b: b["bbox"][1])

# Extract text from each column and concatenate
page_text = []
for column in [left_column, right_column]:
for block in column:
if "lines" in block:
block_text = ""
for line in block["lines"]:
for span in line["spans"]:
block_text += span["text"] + " "
page_text.append(block_text.strip())

# Combine text from both columns into page text
full_text.append("\n".join(page_text))

return "\n\n".join(full_text)

# Fetch the PDF from the URL
url = 'https://hal.science/hal-04206682/document'

try:
response = requests.get(url)
response.raise_for_status()  # Raise an error if the request failed
pdf_file = io.BytesIO(response.content)  # Load the PDF content into memory

# Extract text from the PDF
text = extract_text_from_pdf(pdf_file)
print(text)
except requests.exceptions.RequestException as e:
print(f"Error downloading the PDF: {e}")

[/code]
Примечание, не имеющее отношения к моему вопросу, но может быть полезно читателям: код способен правильно извлекать код, соответствующий структуре PDF с двумя столбцами. Однако он не делает различия между текстом «Рисунок» и основным текстом рукописи, поэтому добавляет текст «Рисунок» между основным текстом рукописи, где встречается «Рисунок». И я не знаю, как удалить[code]enter code here[/code] текст «Рисунок». 

Подробнее здесь: [url]https://stackoverflow.com/questions/79222751/detect-and-remove-references-from-a-pdf-when-the-section-title-references-or[/url]