Вставка тегов XML в определенную часть файла без нарушения формата

Вставка тегов XML в определенную часть файла без нарушения формата ⇐ Python

1 сообщение • Страница 1 из 1

Гость

Вставка тегов XML в определенную часть файла без нарушения формата

Цитата

Сообщение Гость » 09 мар 2024, 16:50

I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:

Код: Выделить всё





19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...

The sentences I need to tag span over several lines. The lines are tagged with the line break tag "

Код: Выделить всё

"[/b]. I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....
The output should look like:

Код: Выделить всё





19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...

My code looks like:

Код: Выделить всё

import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk

# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')

def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1]  # Removing namespace
return tree

def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())

for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements]  # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]

# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = "  ".join(sentences).split("\n")
print(sentences2)

# Clear the paragraph's existing content
paragraph.clear()

# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence

# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')

I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.
I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...
Any and all help appreciated at this point.

Источник: https://stackoverflow.com/questions/781 ... ing-format

1709992231

Гость


[b]I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:[/b]
[code]




19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...





[/code]
[b]The sentences I need to tag span over several lines. The lines are tagged with the line break tag "[code][/code]"[/b]. I need to somehow tag the sentences, and then append them back with their original formal to the file.  The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....
The output should look like:
[code]




19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...




[/code]
My code looks like:
[code]import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk

# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')

def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1]  # Removing namespace
return tree

def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())

for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements]  # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]

# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = "  ".join(sentences).split("\n")
print(sentences2)

# Clear the paragraph's existing content
paragraph.clear()

# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence

# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')
[/code]
I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.
I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...
Any and all help appreciated at this point.
 

Источник: [url]https://stackoverflow.com/questions/78130662/inserting-xml-tags-at-specific-part-of-file-without-disrupting-format[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Symfony2 — Настройка облака тегов с использованием веса тегов для популярных тегов

Последнее сообщение Anonymous « 25 окт 2024, 08:03
Добавлено в форуме Php

Anonymous » 25 окт 2024, 08:03 » в форуме Php

Я пытаюсь настроить взвешенное облако тегов, которое работает, когда теги являются строковыми свойствами в сущности блога.

Теперь я настроил теги как отдельные сущности и связал их с блогом как двунаправленное отношение ManyToMany/ManyToMany....

0 Ответы

83 Просмотры

Последнее сообщение Anonymous
25 окт 2024, 08:03
Поиск решений для приостановки приложения в консоли Google Play из-за нарушения политики нарушения функциональности

Последнее сообщение Anonymous « 29 апр 2024, 03:15
Добавлено в форуме Android

Anonymous » 29 апр 2024, 03:15 » в форуме Android

В настоящее время мне нужна помощь по поводу проблемы с одним из моих приложений в консоли Google Play. Приложение, предназначенное для обеспечения безопасности и создания отчетов, было заблокировано исключительно из-за многочисленных нарушений...

0 Ответы

53 Просмотры

Последнее сообщение Anonymous
29 апр 2024, 03:15
Ищем пример, в котором строка формата G дает другой результат, чем строка формата F (строки формата перечисления C#).

Последнее сообщение Anonymous « 04 ноя 2024, 17:09
Добавлено в форуме C#

Anonymous » 04 ноя 2024, 17:09 » в форуме C#

Вот MRE:
FileAccessPermissions permissions = FileAccessPermissions.Read | FileAccessPermissions.Special;

Console.WriteLine(permissions.ToString( G ));
Console.WriteLine(permissions.ToString( F ));
Console.WriteLine(permissions.ToString( D ));...

0 Ответы

56 Просмотры

Последнее сообщение Anonymous
04 ноя 2024, 17:09
Ищем пример, в котором строка формата G дает другой результат, чем строка формата F (строки формата перечисления C#).

Последнее сообщение Anonymous « 04 ноя 2024, 17:25
Добавлено в форуме C#

Anonymous » 04 ноя 2024, 17:25 » в форуме C#

Вот MRE:
FileAccessPermissions permissions = FileAccessPermissions.Read | FileAccessPermissions.Special;

Console.WriteLine(permissions.ToString( G ));
Console.WriteLine(permissions.ToString( F ));
Console.WriteLine(permissions.ToString( D ));...

0 Ответы

84 Просмотры

Последнее сообщение Anonymous
04 ноя 2024, 17:25
Есть ли способ загрузить определенную часть страницы, а затем загрузить другую часть страницы без использования jQuery и

Последнее сообщение Anonymous « 25 июн 2025, 21:48
Добавлено в форуме Jquery

Anonymous » 25 июн 2025, 21:48 » в форуме Jquery

Для этого проекта я должен иметь его так, чтобы страница загружалась при запрашивании базы данных, но даже если запрос не закончил загрузку, я хочу, чтобы страница отображалась и ждала, пока будет выполнен запрос. Таким образом, страница не просто...

0 Ответы

14 Просмотры

Последнее сообщение Anonymous
25 июн 2025, 21:48

Вернуться в «Python»