19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
The sentences I need to tag span over several lines. The lines are tagged with the line break tag "
"[/b]. I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....
The output should look like:
19.
esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
et affinium generi tui responsum fratri meo coram dedisse, non
possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
vel generum mihi per literas responsurum. Frater igitur dixit quidem
mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
non potui aliter interpretari quam ali fortassis aliquid monstri,
ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
difficultates sunt ortae, iampridem domino deque commendavi, qui
per Mosen. Mea est ultro et ego retribuam eis in tempore.
De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk
# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')
def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1] # Removing namespace
return tree
def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())
for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements] # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = " ".join(sentences).split("\n")
print(sentences2)
# Clear the paragraph's existing content
paragraph.clear()
# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence
# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')
I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.
I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...
Any and all help appreciated at this point.
[b]I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:[/b] [code]
19. esse Christolam meam te adeo candide et humaniter Bullingere colendissime, esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine et affinium generi tui responsum fratri meo coram dedisse, non possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te, vel generum mihi per literas responsurum. Frater igitur dixit quidem mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere non potui aliter interpretari quam ali fortassis aliquid monstri, ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae difficultates sunt ortae, iampridem domino deque commendavi, qui per Mosen. Mea est ultro et ego retribuam eis in tempore. De altero etiam capite accipio tuam excusationem. Quum enim tam sancte affirmes te semper erga nos non aliter quam bene et fuisse et ... ... ...
[/code] [b]The sentences I need to tag span over several lines. The lines are tagged with the line break tag "[code][/code]"[/b]. I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid.... The output should look like: [code]
19. esse Christolam meam te adeo candide et humaniter Bullingere colendissime, esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine et affinium generi tui responsum fratri meo coram dedisse, non possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te, vel generum mihi per literas responsurum. Frater igitur dixit quidem mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere non potui aliter interpretari quam ali fortassis aliquid monstri, ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae difficultates sunt ortae, iampridem domino deque commendavi, qui per Mosen. Mea est ultro et ego retribuam eis in tempore. De altero etiam capite accipio tuam excusationem. Quum enim tam sancte affirmes te semper erga nos non aliter quam bene et fuisse et ... ... ...
[/code] My code looks like: [code]import xml.etree.ElementTree as ET from nltk.tokenize import sent_tokenize import nltk
# Ensure NLTK's sentence tokenizer is available nltk.download('punkt')
def remove_ns_prefix(tree): for elem in tree.iter(): if '}' in elem.tag: elem.tag = elem.tag.split('}', 1)[1] # Removing namespace return tree
def process_file(input_xml, output_xml): tree = ET.parse(input_xml) root = remove_ns_prefix(tree.getroot())
for body in root.findall('.//body'): for paragraph in body.findall('.//p'): # Extract all lb elements and following texts lb_elements = list(paragraph.findall('.//lb')) lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements] # Store lb ids text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
# Combine the text and tokenize into sentences entire_text = ' '.join(text_after_lb) sentences = sent_tokenize(entire_text) sentences2 = " ".join(sentences).split("\n") print(sentences2)
# Clear the paragraph's existing content paragraph.clear()
# Pair up lb tags and sentences using zip, reinsert them into the paragraph for lb_id, sentence in zip(lb_ids, sentences): # Reinsert lb element lb_attrib = {'xml:id': lb_id} if lb_id else {} new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib) # Attach sentence to this lb if sentence: sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'}) sentence_elem.text = sentence
# Write the modified tree to a new file tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml') [/code] I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue. I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match... Any and all help appreciated at this point.
В настоящее время мне нужна помощь по поводу проблемы с одним из моих приложений в консоли Google Play. Приложение, предназначенное для обеспечения безопасности и создания отчетов, было заблокировано исключительно из-за многочисленных нарушений...
Для этого проекта я должен иметь его так, чтобы страница загружалась при запрашивании базы данных, но даже если запрос не закончил загрузку, я хочу, чтобы страница отображалась и ждала, пока будет выполнен запрос. Таким образом, страница не просто...