Вставка тегов XML в определенную часть файла без нарушения форматаPython

Программы на Python
Ответить Пред. темаСлед. тема
Гость
 Вставка тегов XML в определенную часть файла без нарушения формата

Сообщение Гость »


I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:

Код: Выделить всё





19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...





The sentences I need to tag span over several lines. The lines are tagged with the line break tag ""[/b]. I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....
The output should look like:

Код: Выделить всё





19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...




My code looks like:

Код: Выделить всё

import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk

# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')

def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1]  # Removing namespace
return tree

def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())

for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements]  # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]

# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = "  ".join(sentences).split("\n")
print(sentences2)

# Clear the paragraph's existing content
paragraph.clear()

# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence

# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')
I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.
I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...
Any and all help appreciated at this point.


Источник: https://stackoverflow.com/questions/781 ... ing-format
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение

Вернуться в «Python»