Как оптимизировать веб-скрапинг сайта ASPX с помощью динамически генерируемого всплывающего окна

Как оптимизировать веб-скрапинг сайта ASPX с помощью динамически генерируемого всплывающего окна ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Как оптимизировать веб-скрапинг сайта ASPX с помощью динамически генерируемого всплывающего окна

Цитата

Сообщение Anonymous » 30 окт 2024, 13:10

У меня есть сайт aspx, на котором есть форма, и когда вы ее заполняете, появляется всплывающее окно с html-таблицей, которую я хочу очистить. Всплывающее окно создается динамически, как в формате www.xyz.com/something/something/"Temp"F ... lData.aspx, где температура меняется при переключении месяцев в форме. (сайт: https://www.osfi-bsif.gc.ca/en/data-for ... data-banks). У меня есть код, который может извлекать данные, но он работает невероятно медленно, и мне не удалось найти какой-либо скрытый API. Только что обнаружил в коде, что это просто происходит

Код: Выделить всё

//
Это то, что у меня есть на данный момент.< /p>
[code]def extract_table_data(driver, date_label, bank_code):

tables = driver.find_elements(By.CSS_SELECTOR, 'table.w100.borderspace-0.table.table-lined')
#checking to see if two table exist
if len(tables) < 2:
print("Not enough tables found on the page.")
return pd.DataFrame()

#looking at second table
target_table = tables[1]

descriptions = ['(a) Federal and Provincial', '(b) Municipal or School Corporations',
'(c) Deposit-taking institutions', '(i) Tax sheltered', '(ii) Other', '(e) Other']
data = []
description_index = 0
capture = False

# Start iterating over rows in the target table
for row in target_table.find_elements(By.TAG_NAME, 'tr'):
row_text = row.text.strip()  # Get row text once to minimize repeated calls

# Capturing when we find start of table
if "1. Demand and notice deposits" in row_text:
capture = True  # Enable data capturing

if capture:
cols = row.find_elements(By.TAG_NAME, 'td')
if len(cols) > 1 and description_index < len(descriptions):
# Append data for 'Total' column only (assuming it's in the second cell)
data.append((date_label, descriptions[description_index], bank_code, cols[1].text.strip()))
description_index += 1

if description_index >= len(descriptions) or "2.  Fixed-term deposits" in row_text:
break

return pd.DataFrame(data, columns=['Date', 'Category', 'Bank Code', 'Total'])

# Main scraping function
def scrape_data(driver, bank_code, date_option):
try:
# Load website
driver.get("https://ws1ext.osfi-bsif.gc.ca/WebApps/FINDAT/DTIBanks.aspx?T=0&LANG=E")
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_institutionTypeCriteria_type1RadioButton"))).click()

# Select bank and date
Select(driver.find_element(By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_institutionTypeCriteria_institutionsDropDownList")).select_by_value(bank_code)
WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_dtiReportCriteria_monthlyRadioButton"))).click()
date_dropdown = Select(WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_dtiReportCriteria_monthlyDatesDropDownList"))))
date_dropdown.select_by_value(date_option)

# Extract the formatted date for labeling
date_label = date_dropdown.first_selected_option.text.split()
date_label = f"{date_label[0][:3]}-{date_label[-1]}"

# Submit form
driver.find_element(By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_submitButton").click()
time.sleep(2)

# Switch to new window for data extraction
main_window = driver.current_window_handle
for handle in driver.window_handles:
if handle != main_window:
driver.switch_to.window(handle)
break

# Wait for tables to load and capture data
WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.TAG_NAME, "table")))
return extract_table_data(driver, date_label, bank_code)

except Exception as e:
print(f"Error processing bank {bank_code} and date {date_option}: {e}")
return None

finally:
driver.close()
driver.switch_to.window(main_window)

def main():
date_options = generate_date_options(2024, 8, 2024, 8)
bank_codes = ["Z005"]

final_data, final_summary = [], []
driver = initialize_driver()

for bank_code in bank_codes:
for date_option in date_options:
print(f"Processing bank: {bank_code}, date: {date_option}")
df = scrape_data(driver, bank_code, date_option)
if not df.empty:
# Accumulate data without immediately processing each DataFrame
final_data.append(df)

# Code for generating two tables, one detailed and one with only total
if final_data:
final_df = pd.concat(final_data, ignore_index=True)
final_df['Total'] = pd.to_numeric(final_df['Total'].str.replace(',', ''), errors='coerce')

summary_df = final_df.groupby(['Date', 'Bank Code', 'Category'])['Total'].sum().reset_index()
summary_df.to_csv('demand_and_notice_deposits_summary.csv', index=False)
final_df.to_csv('demand_and_notice_deposits_detailed.csv', index=False)
print("Scraping complete!")
else:
print("No data was found for any of the banks or dates.")

driver.quit()
return final_data, final_summary

final_data, final_summary = main()

Можно ли как-нибудь это оптимизировать?

Подробнее здесь: https://stackoverflow.com/questions/791 ... ted-pop-up

1730283058

Anonymous

У меня есть сайт aspx, на котором есть форма, и когда вы ее заполняете, появляется всплывающее окно с html-таблицей, которую я хочу очистить. Всплывающее окно создается динамически, как в формате www.xyz.com/something/something/"Temp"FinancialData.aspx, где температура меняется при переключении месяцев в форме. (сайт: https://www.osfi-bsif.gc.ca/en/data-forms/financial-data/financial-data-banks). У меня есть код, который может извлекать данные, но он работает невероятно медленно, и мне не удалось найти какой-либо скрытый API. Только что обнаружил в коде, что это просто происходит
[code]
//
Это то, что у меня есть на данный момент.< /p>
[code]def extract_table_data(driver, date_label, bank_code):

tables = driver.find_elements(By.CSS_SELECTOR, 'table.w100.borderspace-0.table.table-lined')
#checking to see if two table exist
if len(tables) < 2:
print("Not enough tables found on the page.")
return pd.DataFrame()

#looking at second table
target_table = tables[1]

descriptions = ['(a) Federal and Provincial', '(b) Municipal or School Corporations',
'(c) Deposit-taking institutions', '(i) Tax sheltered', '(ii) Other', '(e) Other']
data = []
description_index = 0
capture = False

# Start iterating over rows in the target table
for row in target_table.find_elements(By.TAG_NAME, 'tr'):
row_text = row.text.strip()  # Get row text once to minimize repeated calls

# Capturing when we find start of table
if "1. Demand and notice deposits" in row_text:
capture = True  # Enable data capturing

if capture:
cols = row.find_elements(By.TAG_NAME, 'td')
if len(cols) > 1 and description_index < len(descriptions):
# Append data for 'Total' column only (assuming it's in the second cell)
data.append((date_label, descriptions[description_index], bank_code, cols[1].text.strip()))
description_index += 1

if description_index >= len(descriptions) or "2.  Fixed-term deposits" in row_text:
break

return pd.DataFrame(data, columns=['Date', 'Category', 'Bank Code', 'Total'])

# Main scraping function
def scrape_data(driver, bank_code, date_option):
try:
# Load website
driver.get("https://ws1ext.osfi-bsif.gc.ca/WebApps/FINDAT/DTIBanks.aspx?T=0&LANG=E")
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_institutionTypeCriteria_type1RadioButton"))).click()

# Select bank and date
Select(driver.find_element(By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_institutionTypeCriteria_institutionsDropDownList")).select_by_value(bank_code)
WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_dtiReportCriteria_monthlyRadioButton"))).click()
date_dropdown = Select(WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_dtiReportCriteria_monthlyDatesDropDownList"))))
date_dropdown.select_by_value(date_option)

# Extract the formatted date for labeling
date_label = date_dropdown.first_selected_option.text.split()
date_label = f"{date_label[0][:3]}-{date_label[-1]}"

# Submit form
driver.find_element(By.ID, "DTIWebPartManager_gwpDTIBankControl1_DTIBankControl1_submitButton").click()
time.sleep(2)

# Switch to new window for data extraction
main_window = driver.current_window_handle
for handle in driver.window_handles:
if handle != main_window:
driver.switch_to.window(handle)
break

# Wait for tables to load and capture data
WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.TAG_NAME, "table")))
return extract_table_data(driver, date_label, bank_code)

except Exception as e:
print(f"Error processing bank {bank_code} and date {date_option}: {e}")
return None

finally:
driver.close()
driver.switch_to.window(main_window)

def main():
date_options = generate_date_options(2024, 8, 2024, 8)
bank_codes = ["Z005"]

final_data, final_summary = [], []
driver = initialize_driver()

for bank_code in bank_codes:
for date_option in date_options:
print(f"Processing bank: {bank_code}, date: {date_option}")
df = scrape_data(driver, bank_code, date_option)
if not df.empty:
# Accumulate data without immediately processing each DataFrame
final_data.append(df)

# Code for generating two tables, one detailed and one with only total
if final_data:
final_df = pd.concat(final_data, ignore_index=True)
final_df['Total'] = pd.to_numeric(final_df['Total'].str.replace(',', ''), errors='coerce')

summary_df = final_df.groupby(['Date', 'Bank Code', 'Category'])['Total'].sum().reset_index()
summary_df.to_csv('demand_and_notice_deposits_summary.csv', index=False)
final_df.to_csv('demand_and_notice_deposits_detailed.csv', index=False)
print("Scraping complete!")
else:
print("No data was found for any of the banks or dates.")

driver.quit()
return final_data, final_summary

final_data, final_summary = main()
[/code]
Можно ли как-нибудь это оптимизировать? 

Подробнее здесь: [url]https://stackoverflow.com/questions/79139735/how-to-optimize-webscraping-an-aspx-site-with-dynamically-generated-pop-up[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Сложность очистки HTML-страницы с динамически генерируемого веб-сайта с помощью Python.

Последнее сообщение Anonymous « 27 окт 2024, 17:33
Добавлено в форуме Python

Anonymous » 27 окт 2024, 17:33 » в форуме Python

Я пытаюсь получить некоторые данные с веб-сайта с помощью Python. Кажется, что веб-сайт генерирует свой контент с помощью Javascript, поэтому я не могу использовать стандартную библиотеку запросов. Я попробовал модуль Requests-html и Selenium,...

0 Ответы

15 Просмотры

Последнее сообщение Anonymous
27 окт 2024, 17:33
Сложность очистки HTML-страницы с динамически генерируемого веб-сайта с помощью Python.

Последнее сообщение Anonymous « 27 окт 2024, 18:32
Добавлено в форуме Python

Anonymous » 27 окт 2024, 18:32 » в форуме Python

Я пытаюсь получить некоторые данные с веб-сайта с помощью Python. Кажется, что веб-сайт генерирует свой контент с помощью Javascript, поэтому я не могу использовать стандартную библиотеку запросов. Я попробовал модуль Requests-html и Selenium,...

0 Ответы

9 Просмотры

Последнее сообщение Anonymous
27 окт 2024, 18:32
Веб-скрапинг веб-сайта с таблицей с разбивкой на страницы, но без кнопки «Далее»

Последнее сообщение Anonymous « 05 дек 2024, 01:30
Добавлено в форуме Python

Anonymous » 05 дек 2024, 01:30 » в форуме Python

Я пытаюсь получить данные с первых трех страниц таблицы с разбивкой на страницы: Пока могу получить данные только с первой страницы. (У них есть API, но он обновляется только еженедельно, что для меня недостаточно часто).
Вот что у меня есть:
from...

0 Ответы

16 Просмотры

Последнее сообщение Anonymous
05 дек 2024, 01:30
Веб-скрапинг веб-сайта с таблицей с разбивкой на страницы, но без кнопки «Далее»

Последнее сообщение Anonymous « 05 дек 2024, 08:29
Добавлено в форуме Python

Anonymous » 05 дек 2024, 08:29 » в форуме Python

Я пытаюсь получить данные с первых трех страниц таблицы с разбивкой на страницы: Пока могу получить данные только с первой страницы. (У них есть API, но он обновляется только еженедельно, что для меня недостаточно часто).
Вот что у меня есть:
from...

0 Ответы

10 Просмотры

Последнее сообщение Anonymous
05 дек 2024, 08:29
Заглушки Python для динамически генерируемого класса

Последнее сообщение Anonymous « 06 дек 2024, 15:08
Добавлено в форуме Python

Anonymous » 06 дек 2024, 15:08 » в форуме Python

У меня есть класс, который создается динамически на основе словаря: выходные данные представляют собой дерево, в котором каждый лист является объектом класса Item. Каждый элемент может иметь 0 или несколько дочерних элементов (другие элементы)....

0 Ответы

9 Просмотры

Последнее сообщение Anonymous
06 дек 2024, 15:08

Вернуться в «Python»