Как извлечь текст из таблиц в HTML, сохраняя при этом его форму? BS4/Питон

Как извлечь текст из таблиц в HTML, сохраняя при этом его форму? BS4/Питон ⇐ Python

1 сообщение • Страница 1 из 1

Гость

Как извлечь текст из таблиц в HTML, сохраняя при этом его форму? BS4/Питон

Цитата

Сообщение Гость » 07 мар 2024, 13:45

I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns） and rows are sometimes not matched. How do I decompose the table to Pure text while still maintaining its readability? Is there a module/lib that can does the work for me?

here's a example of the html table, but it could get more complex

Here's the code I tried with the table, it is very ugly yes but it sometimes work.

for table in soup.find_all('table'): try: sections = [] current_section_header = None current_section_rows = [] for row in table.find_all('tr'): colspan_cell = row.find(['td', 'th'], colspan=True) if colspan_cell: if current_section_header: sections.append((current_section_header, current_section_rows)) current_section_header = colspan_cell current_section_rows = [] else: current_section_rows.append(row) if current_section_header: sections.append((current_section_header, current_section_rows)) for header, rows in sections: output_html_content += header.get_text(strip=True) + "

" if len(rows) == 1: for cell in rows[0].find_all(['td', 'th']): output_html_content += cell.get_text(strip=True) + ' ' else: headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])] for row in rows[1:]: cells = row.find_all(['td', 'th']) for header, cell in zip(headers, cells): output_html_content += f"{header}: {cell.get_text(strip=True)}

" output_html_content += "

" I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.

But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.

Источник: https://stackoverflow.com/questions/781 ... s-form-bs4

1709808330

Гость


I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns） and rows are sometimes not matched. [b]How do I decompose the table to Pure text while still maintaining its readability?[/b] Is there a module/lib that can does the work for me?
 
here's a example of the html table, but it could get more complex
 
Here's the code I tried with the table, it is very ugly yes but it sometimes work.
 
        for table in soup.find_all('table'):             try:                  sections = []                 current_section_header = None                 current_section_rows = []                                  for row in table.find_all('tr'):                     colspan_cell = row.find(['td', 'th'], colspan=True)                     if colspan_cell:                         if current_section_header:                             sections.append((current_section_header, current_section_rows))                         current_section_header = colspan_cell                         current_section_rows = []                     else:                         current_section_rows.append(row)                                  if current_section_header:                     sections.append((current_section_header, current_section_rows))                                  for header, rows in sections:                     output_html_content += header.get_text(strip=True) + "

"                     if len(rows) == 1:                         for cell in rows[0].find_all(['td', 'th']):                             output_html_content += cell.get_text(strip=True) + ' '                     else:                         headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])]                         for row in rows[1:]:                             cells = row.find_all(['td', 'th'])                             for header, cell in zip(headers, cells):                                 output_html_content += f"{header}: {cell.get_text(strip=True)}

"                     output_html_content += "

"  I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.
 
But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.
 

Источник: [url]https://stackoverflow.com/questions/78120642/how-do-i-extract-texts-from-tables-in-html-while-still-maintaining-its-form-bs4[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Как я могу выровнять текст префикса и курсор в одной строке в TextField Flutter, сохраняя при этом текст префикса всегда

Последнее сообщение Anonymous « 30 дек 2024, 14:44
Добавлено в форуме IOS

Anonymous » 30 дек 2024, 14:44 » в форуме IOS

Я работаю над TextField во Flutter и пытаюсь добиться следующего:

Выравнивание текста по префиксу: Я хочу, чтобы текст префикса (например, «Кому:») был виден всегда, даже если пользователь не сфокусировался на TextField.
Выравнивание курсора : я...

0 Ответы

36 Просмотры

Последнее сообщение Anonymous
30 дек 2024, 14:44
Как я могу выровнять текст префикса и курсор в одной строке в TextField Flutter, сохраняя при этом текст префикса всегда

Последнее сообщение Anonymous « 31 дек 2024, 08:52
Добавлено в форуме IOS

Anonymous » 31 дек 2024, 08:52 » в форуме IOS

Я работаю над TextField во Flutter и пытаюсь добиться следующего:

Выравнивание текста по префиксу: Я хочу, чтобы текст префикса (например, «Кому:») был виден всегда, даже если пользователь не сфокусировался на TextField.
Выравнивание курсора : я...

0 Ответы

46 Просмотры

Последнее сообщение Anonymous
31 дек 2024, 08:52
Как я могу выровнять текст префикса и курсор в одной строке в TextField Flutter, сохраняя при этом текст префикса всегда

Последнее сообщение Anonymous « 08 янв 2025, 07:28
Добавлено в форуме IOS

Anonymous » 08 янв 2025, 07:28 » в форуме IOS

Я работаю над TextField во Flutter и пытаюсь добиться следующего:

Выравнивание текста по префиксу: Я хочу, чтобы текст префикса (например, «Кому:») был виден всегда, даже если пользователь не сосредоточился на TextField.
Выравнивание курсора : я...

0 Ответы

31 Просмотры

Последнее сообщение Anonymous
08 янв 2025, 07:28
BS4 не может извлечь текст из элемента

Последнее сообщение Anonymous « 31 мар 2025, 17:11
Добавлено в форуме Python

Anonymous » 31 мар 2025, 17:11 » в форуме Python

import requests
from bs4 import BeautifulSoup

url = '
headers = { User-Agent : Mozilla/5.0 }

response = requests.get(url, headers=headers)

if response.status_code == 200:
soup = BeautifulSoup(response.text, html.parser )

section = soup.find(...

0 Ответы

3 Просмотры

Последнее сообщение Anonymous
31 мар 2025, 17:11
Как извлечь параметры заголовка в вход на веб -сайт с BS4?

Последнее сообщение Anonymous « 23 апр 2025, 22:29
Добавлено в форуме Python

Anonymous » 23 апр 2025, 22:29 » в форуме Python

Я пытаюсь войти на веб -сайт, который я открыл с BeautifulSoup в Python. Я прикрепил изображение с данными формы, которые я отправил на страницу входа в систему веб -сайтов, извлеченную с вкладки Chrome's Network. Правильно ли я форматировал данные...

0 Ответы

5 Просмотры

Последнее сообщение Anonymous
23 апр 2025, 22:29

Вернуться в «Python»