Как извлечь текст из таблиц в HTML, сохраняя при этом его форму? BS4/Питон ⇐ Python
Как извлечь текст из таблиц в HTML, сохраняя при этом его форму? BS4/Питон
I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns) and rows are sometimes not matched. How do I decompose the table to Pure text while still maintaining its readability? Is there a module/lib that can does the work for me?
here's a example of the html table, but it could get more complex
Here's the code I tried with the table, it is very ugly yes but it sometimes work.
for table in soup.find_all('table'): try: sections = [] current_section_header = None current_section_rows = [] for row in table.find_all('tr'): colspan_cell = row.find(['td', 'th'], colspan=True) if colspan_cell: if current_section_header: sections.append((current_section_header, current_section_rows)) current_section_header = colspan_cell current_section_rows = [] else: current_section_rows.append(row) if current_section_header: sections.append((current_section_header, current_section_rows)) for header, rows in sections: output_html_content += header.get_text(strip=True) + "
" if len(rows) == 1: for cell in rows[0].find_all(['td', 'th']): output_html_content += cell.get_text(strip=True) + ' ' else: headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])] for row in rows[1:]: cells = row.find_all(['td', 'th']) for header, cell in zip(headers, cells): output_html_content += f"{header}: {cell.get_text(strip=True)}
" output_html_content += "
" I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.
But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.
Источник: https://stackoverflow.com/questions/781 ... s-form-bs4
I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns) and rows are sometimes not matched. How do I decompose the table to Pure text while still maintaining its readability? Is there a module/lib that can does the work for me?
here's a example of the html table, but it could get more complex
Here's the code I tried with the table, it is very ugly yes but it sometimes work.
for table in soup.find_all('table'): try: sections = [] current_section_header = None current_section_rows = [] for row in table.find_all('tr'): colspan_cell = row.find(['td', 'th'], colspan=True) if colspan_cell: if current_section_header: sections.append((current_section_header, current_section_rows)) current_section_header = colspan_cell current_section_rows = [] else: current_section_rows.append(row) if current_section_header: sections.append((current_section_header, current_section_rows)) for header, rows in sections: output_html_content += header.get_text(strip=True) + "
" if len(rows) == 1: for cell in rows[0].find_all(['td', 'th']): output_html_content += cell.get_text(strip=True) + ' ' else: headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])] for row in rows[1:]: cells = row.find_all(['td', 'th']) for header, cell in zip(headers, cells): output_html_content += f"{header}: {cell.get_text(strip=True)}
" output_html_content += "
" I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.
But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.
Источник: https://stackoverflow.com/questions/781 ... s-form-bs4
-
- Похожие темы
- Ответы
- Просмотры
- Последнее сообщение