Получение ошибки «неправильно сформированный (неверный токен)» при анализе возвращенного XML-файла с использованием запрPython

Программы на Python
Ответить
Anonymous
 Получение ошибки «неправильно сформированный (неверный токен)» при анализе возвращенного XML-файла с использованием запр

Сообщение Anonymous »

Использование приведенного ниже кода для получения и анализа данных с веб-сайта. Похоже, ошибка указывает на то, что возвращенный XML имеет неверный формат. Например, для оценки использовался валидатор XML, и в возвращаемом XML обнаружена следующая проблема:Error : InvalidTag
Line : 50
Message : Closing tag 'p' is expected inplace of 'P'.

Новичок в XML. Как устранить ошибки форматирования, возвращаемые со стороны сервера?
import pandas as pd
import requests
from datetime import datetime, timedelta
import xml.etree.ElementTree as ET

cik = "0000320193"

BASE_URL = "https://data.sec.gov"
USER_AGENT = "alias (alias199@gmail.com)"
ACC_ENCODING = "gzip, deflate"
HOST_NAME = "www.sec.gov"

headers = {
"User-Agent": USER_AGENT,
"Accept-Encoding": ACC_ENCODING,
"Host": HOST_NAME
}

filing_data = []

test = pd.DataFrame(
{'accessionNumber': ['0000320193-24-000132', '0000320193-24-000130', '0000320193-24-000129', '0000320193-24-000126', '0000320193-24-000116'],
'filingDate': ['2024-12-18', '2024-11-19', '2024-11-19', '2024-11-07', '2024-10-17'],
'form': ['4', '4' ,'4', '4', '4'],
'primaryDocument': ['xslF345X05/wk-form4_1734564614.xml', 'xslF345X05/wk-form4_1732059096.xml', 'xslF345X05/wk-form4_1732059042.xml', 'xslF345X05/wk-form4_1731022209.xml', 'xslF345X05/wk-form4_1729204211.xml']
}
)

for i, filing in test.iterrows():

filing_url = f"{BASE_URL}/Archives/edgar/data/{int(cik)}/{filing['accessionNumber'].replace('-', '')}/{filing['primaryDocument']}"
try:
response = requests.get(filing_url, headers=headers)

if response.status_code == 200:
try:
# Clean up the response content before parsing
content = response.content.decode('utf-8', errors='ignore')

# Fix malformed XML by identifying unclosed tags and repairing them
try:
root = ET.fromstring(content)
except ET.ParseError:
# Attempt to auto-correct malformed XML
if not content.strip().startswith(""):
content = f"{content}"
root = ET.fromstring(content)

filing_info = {
"accessionNumber": filing["accessionNumber"],
"filingDate": filing["filingDate"],
"form": filing["form"],
"content": {},
}

for child in root.iter():
filing_info["content"][child.tag] = child.text

filing_data.append(filing_info)
except ET.ParseError as e:
print(f"Error parsing XML for filing: {filing_url}. Error: {e}")
else:
print(f"Failed to retrieve filing document: {filing_url}. HTTP Status: {response.status_code}")
except Exception as e:
print(f"Error retrieving filing document: {filing_url}. Error: {e}")

Пример XML ниже:
print(response.content)
b'\n\n\n\nSEC FORM \n 4\n\n .FormData {color: blue; background-color: white; font-size: small; font-family: Times, serif;}\n .FormDataC {color: blue; background-color: white; font-size: small; font-family: Times, serif; text-align: center;}\n .FormDataR {color: blue; background-color: white; font-size: small; font-family: Times, serif; text-align: right;}\n .SmallFormData {color: blue; background-color: white; font-size: x-small; font-family: Times, serif;}\n .FootnoteData {color: green; background-color: white; font-size: x-small; font-family: Times, serif;}\n .FormNumText {font-size: small; font-weight: bold; font-family: arial, helvetica, sans-serif;}\n .FormAttention {font-size: medium; font-weight: bold; font-family: helvetica;}\n .FormText {font-size: small; font-weight: normal; font-family: arial, helvetica, sans-serif; text-align: left;}\n .FormTextR {font-size: small; font-weight: normal; font-family: arial, helvetica, sans-serif; text-align: right;}\n .FormTextC {font-size: small; font-weight: normal; font-family: arial, helvetica, sans-serif; text-align: center;}\n .FormEMText {font-size: medium; font-style: italic; font-weight: normal; font-family: arial, helvetica, sans-serif;}\n .FormULText {font-size: medium; text-decoration: underline; font-weight: normal; font-family: arial, helvetica, sans-serif;}\n .SmallFormText {font-size: xx-small; font-family: arial, helvetica, sans-serif; text-align: left;}\n .SmallFormTextR {font-size: xx-small; font-family: arial, helvetica, sans-serif; text-align: right;}\n .SmallFormTextC {font-size: xx-small; font-family: arial, helvetica, sans-serif; text-align: center;}\n .MedSmallFormText {font-size: x-small; font-family: arial, helvetica, sans-serif; text-align: left;}\n .FormTitle {font-size: medium; font-family: arial, helvetica, sans-serif; font-weight: bold;}\n .FormTitle1 {font-size: small; font-family: arial, helvetica, sans-serif; font-weight: bold; border-top: black thick solid;}\n .FormTitle2 {font-size: small; font-family: arial, helvetica, sans-serif; font-weight: bold;}\n .FormTitle3 {font-size: small; font-family: arial, helvetica, sans-serif; font-weight: bold; padding-top: 2em; padding-bottom: 1em;}\n .SectionTitle {font-size: small; text-align: left; font-family: arial, helvetica, sans-serif; \n \t\tfont-weight: bold; border-top: gray thin solid; border-bottom: gray thin solid;}\n .FormName {font-size: large; font-family: arial, helvetica, sans-serif; font-weight: bold;}\n .CheckBox {text-align: center; width: 5px; cell-spacing: 0; padding: 0 3 0 3; border-width: thin; border-style: solid; border-color: black:}\n body {background: white;}\n \n\nSEC Form 4 \n \n\nFORM 4\n\nUNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

STATEMENT OF CHANGES IN BENEFICIAL OWNERSHIP

Filed pursuant to Section 16(a) of the Securities Exchange Act of 1934
or Section 30(h) of the Investment Company Act of 1940\n\n\nOMB APPROVAL\n\n\nOMB Number:\n3235-0287\n\nEstimated average burden\n\nhours per response:\n0.5\n\n\n\n\n\n\xc2\xa0\xc2\xa0\nCheck this box if no longer subject to Section 16. Form 4 or Form 5 obligations may continue. \n See\n\n Instruction 1(b).\n\n\n\xc2\xa0\xc2\xa0\nCheck this box to indicate that a transaction was made pursuant to a contract, instruction or written plan for the purchase or sale of equity securities of the issuer that is intended to satisfy the affirmative defense conditions of Rule 10b5-1(c). See Instruction 10. \n\n\n\n\n\n1. Name and Address of Reporting Person*KONDO CHRIS\n\n\n(Last)\n(First)\n(Middle)\n\n\nONE APPLE PARK WAY\n\n\n\n(Street)\nCUPERTINO\nCA\n95014\n\n\n\n(City)\n(State)\n(Zip)\n\n\n\n2. Issuer Name and
Ticker or Trading Symbol\n Apple Inc.\n [ AAPL ]\n \n\n5. Relationship of Reporting Person(s) to Issuer\n
(Check all applicable)\n\n\nDirector\n\n10% Owner\n\n\nX\nOfficer (give title below)\n\nOther (specify below)\n\n\n\nPrincipal Accounting Officer\n\n\n\n\n\n\n\n3. Date of Earliest Transaction\n (Month/Day/Year)
10/15/2024\n\n\n\n4. If Amendment, Date of Original Filed\n (Month/Day/Year)
\n\n\n6. Individual or Joint/Group Filing (Check Applicable Line)\n \n\nX\nForm filed by One Reporting Person\n\n\n\nForm filed by More than One Reporting Person\n\n\n\n\n\n\n\nTable I - Non-Derivative Securities Acquired, Disposed of, or Beneficially Owned
\n\n1. Title of Security (Instr. \n 3)\n \n2. Transaction Date\n (Month/Day/Year)\n2A. Deemed Execution Date, if any\n (Month/Day/Year)\n3. Transaction Code (Instr. \n 8)\n \n4. Securities Acquired (A) or Disposed Of (D) (Instr. \n 3, 4 and 5)\n \n5. \n Amount of Securities Beneficially Owned Following Reported Transaction(s) (Instr. \n 3 and 4)\n \n6. Ownership Form: Direct (D) or Indirect (I) (Instr. \n 4)\n \n7. Nature of Indirect Beneficial Ownership (Instr. \n 4)\n \n\n\nCode\nV\nAmount\n(A) or (D)\nPrice\n\n\n\n\nCommon Stock\n10/15/2024\n\nM\n\n8,115\nA\n(1)\n23,534\nD\n\n\n\n\nCommon Stock(2)\n\n10/15/2024\n\nF\n\n3,985\nD\n\n$233.85\n\n19,549\nD\n\n\n\n\n\n\n\nTable II - Derivative Securities Acquired, Disposed of, or Beneficially Owned(e.g., puts, calls, warrants, options, convertible securities)\n\n\n1. Title of Derivative Security (Instr. \n 3)\n \n2. Conversion or Exercise Price of Derivative Security\n \n3. Transaction Date\n (Month/Day/Year)\n3A. Deemed Execution Date, if any\n (Month/Day/Year)\n4. Transaction Code (Instr. \n 8)\n \n5. \n Number of Derivative Securities Acquired (A) or Disposed of (D) (Instr. \n 3, 4 and 5)\n \n6. Date Exercisable and Expiration Date \n (Month/Day/Year)\n7. Title and Amount of Securities Underlying Derivative Security (Instr. \n 3 and 4)\n \n8. Price of Derivative Security (Instr. \n 5)\n \n9. \n Number of derivative Securities Beneficially Owned Following Reported Transaction(s) (Instr. \n 4)\n \n10. Ownership Form: Direct (D) or Indirect (I) (Instr. \n 4)\n \n11. Nature of Indirect Beneficial Ownership (Instr. \n 4)\n \n\n\nCode\nV\n(A)\n(D)\nDate Exercisable\nExpiration Date\nTitle\n

Подробнее здесь: https://stackoverflow.com/questions/793 ... ned-xml-fi
Ответить

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

Вернуться в «Python»