BeautifulSoup4 и Pandas возвращают пустые столбцы DataFrame: обновление: теперь используется Selenium в Google-Colab

BeautifulSoup4 и Pandas возвращают пустые столбцы DataFrame: обновление: теперь используется Selenium в Google-Colab ⇐ Python

1 сообщение • Страница 1 из 1

Гость

BeautifulSoup4 и Pandas возвращают пустые столбцы DataFrame: обновление: теперь используется Selenium в Google-Colab

Цитата

Сообщение Гость » 09 мар 2024, 22:38

Я ищу публичный список мировых банков
Мне не нужны филиалы и полные адреса, а только название и веб-сайт. Я думаю о данных... XML, CSV... со следующими полями:
название банка, название страны или код страны (две буквы ISO) Веб-сайт: необязательно: город штаб-квартиры банка Для каждого банка по одной записи на страну присутствия. Кстати: особенно интересны небольшие банки
Я нашел отличную страницу, которая очень и очень информативна - видите - на ней 9000 банков в Европе:
смотреть от а до я:
https://thebanks.eu/search

Код: Выделить всё

**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium

**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom

see a detailed page: https://thebanks.eu/banks/9563
i need to have this data
Contacts

Код: Выделить всё

Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/

approach: my approach is to use bs4, request and pandas
btw: perhaps we can count form zero to 100 000 - in order to get all the bank that are stored inthe db:
see a detailed page: https://thebanks.eu/banks/9563
i run on colab this:

Код: Выделить всё

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
#  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")

for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

see what is getting back

Код: Выделить всё

Empty DataFrame
Columns: []
Index: []

well it seems to me that there is an issue with the scraping process. i tried some different approachs by inspecting the elements on the webpage again and again to ensure i am extracting the correct information on the page.
well should also print out some extra debug information to help diagnose the problem.
update: good evening dear @Asish M. and @eternal_white
thank you very much for your comments and for sharing your ideas: Food for thoughts: As for the Selenium - i think that this is a good idea - and for running it (selenium) on Google-Colab i have learned from Jacob Padilla
@Jacob / @user:21216449 :: see Jacobs page: https://github.com/jpjacobpadilla with the Google-Colab-Selenium: https://github.com/jpjacobpadilla/Google-Colab-Selenium and the default Options:

Код: Выделить всё

The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.

well i think that this approach is worth to think about: so we can go like so:
were using Selenium in Google Colab to bypass the Cloudflare (that you mentioned - eternal_white) blocking and scrape the desired data can be nice but feasible approach. Вот некоторые мысли о пошаговом подходе и о том, как его настроить с помощью пакета google-colab-selenium от Джейкоба Падиллы:

Код: Выделить всё

Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff

!pip install google-colab-selenium
we also need to install Selenium:
diff

Код: Выделить всё

!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time

and then we need to setup Selenium WebDriver:
Configure the Chrome WebDriver with the necessary options:

Код: Выделить всё

# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)

here we re going to define Function for Scraping:
we define a function to scrape bank data using Selenium:

Код: Выделить всё

def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # first of all - we let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}

and then we could go and scrape Data: Now we can scrape the data using the defined function:

Код: Выделить всё

# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

and - in one single shot:

Код: Выделить всё

# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # Let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website":  веб-сайт

# Список URL-адресов для очистки банковских данных по странам
urls = [
"https://thebanks.eu/search",
"https: //thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# При необходимости добавьте дополнительные URL-адреса для других стран
]

# Список для хранения банковских данных
bank_data = []

# Перебор URL-адресов и очистка банковских данных
для поиска URL-адресов в URL-адресах:
bank_data.append(scrape_bank_data_with_selenium(url))

# Преобразование списка словарей в DataFrame pandas
df = pd.DataFrame(bank_data)

# Распечатайте DataFrame
print(df)

# Закройте WebDriver
driver.quit()

< br />посмотрите, что я получил – на google-colab:

Код: Выделить всё

TypeError                                 Traceback (most recent call last)

 in ()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'

Источник: https://stackoverflow.com/questions/781 ... -now-using

1710013120

Гость


Я ищу публичный список мировых банков
Мне не нужны филиалы и полные адреса, а только название и веб-сайт. Я думаю о данных... XML, CSV... со следующими полями:
название банка, название страны или код страны (две буквы ISO) Веб-сайт: необязательно: город штаб-квартиры банка Для каждого банка по одной записи на страну присутствия. [b]Кстати:[/b] особенно интересны небольшие банки
Я нашел отличную страницу, которая очень и очень информативна - видите - на ней 9000 банков в Европе:
смотреть от а до я:
https://thebanks.eu/search
[code]**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium

**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom
[/code]
see a detailed page: https://thebanks.eu/banks/9563
i need to have this data
[b]Contacts[/b]
[code]Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/
[/code]
approach: my approach is to use bs4, request and pandas
[b]btw[/b]: perhaps we can count form zero to 100 000 - in order to get all the bank that are stored inthe db:
see a detailed page: https://thebanks.eu/banks/9563
i run on colab this:
[code]import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
#  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")

for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)
[/code]
see what is getting back
[code]Empty DataFrame
Columns: []
Index: []
[/code]
well it seems to me that there is an issue with the scraping process. i tried some  different approachs by inspecting the elements on the webpage again  and again to ensure i am extracting the correct information on the page.
well should also print out some extra debug information to help diagnose the problem.
[b]update:[/b] good evening dear @Asish M. and  @eternal_white
thank you very much for your comments and for sharing your ideas: Food for thoughts:  As for the Selenium - i think that this is a good idea - and for running it (selenium) on Google-Colab i have learned from Jacob Padilla
@Jacob / @user:21216449 :: see Jacobs page:  https://github.com/jpjacobpadilla with the Google-Colab-Selenium: https://github.com/jpjacobpadilla/Google-Colab-Selenium and the default Options:
[code]The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.
[/code]
well i think that this approach is worth to think about: so we can go like so:
were using Selenium in Google Colab to bypass the Cloudflare (that you mentioned - eternal_white) blocking and scrape the desired data can be nice but feasible approach.  Вот некоторые мысли о пошаговом подходе и о том, как его настроить с помощью пакета google-colab-selenium от Джейкоба Падиллы:[code]Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff
[/code]
!pip install google-colab-selenium
we also need to install Selenium:
diff
[code]!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time
[/code]
and then we need to setup Selenium WebDriver:
Configure the Chrome WebDriver with the necessary options:
[code]# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)
[/code]
here we re going to define Function for Scraping:
we define a function to scrape bank data using Selenium:
[code]def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # first of all - we let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}
[/code]
and then we could go and scrape Data: Now we can scrape the data using the defined function:
[code]# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)
[/code]
and - in one [b]single shot:[/b]
[code]# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # Let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website":  веб-сайт

# Список URL-адресов для очистки банковских данных по странам
urls = [
"https://thebanks.eu/search",
"https: //thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# При необходимости добавьте дополнительные URL-адреса для других стран
]

# Список для хранения банковских данных
bank_data = []

# Перебор URL-адресов и очистка банковских данных
для поиска URL-адресов в URL-адресах:
bank_data.append(scrape_bank_data_with_selenium(url))

# Преобразование списка словарей в DataFrame pandas
df = pd.DataFrame(bank_data)

# Распечатайте DataFrame
print(df)

# Закройте WebDriver
driver.quit()
[/code]< br />посмотрите, что я [b]получил[/b] – на [b]google-colab:[/b]
[code]TypeError                                 Traceback (most recent call last)

 in ()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'
[/code] 

Источник: [url]https://stackoverflow.com/questions/78127318/beautifulsoup4-and-pandas-return-back-empty-dataframe-columns-update-now-using[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Используя Python с Selenium и BeautifulSoup4, как я могу получить данные после того, как Javascript загрузил все элемент

Последнее сообщение Anonymous « 25 июн 2024, 06:16
Добавлено в форуме Python

Anonymous » 25 июн 2024, 06:16 » в форуме Python

Я пытаюсь извлечь данные с веб-сайта-песочницы, просто чтобы попрактиковаться и начать использовать Python для очистки веб-данных.
Мне удалось извлечь много данных, используя основы, однако я нашел элемент, который загружается динамически после...

0 Ответы

39 Просмотры

Последнее сообщение Anonymous
25 июн 2024, 06:16
Объединение двух фреймов данных, которые имеют общие «индексные столбцы» (столбцы идентификаторов), но не столбцы данных

Последнее сообщение Anonymous « 12 дек 2024, 19:33
Добавлено в форуме Python

Anonymous » 12 дек 2024, 19:33 » в форуме Python

Я делаю это:
import polars as pl
import sys

red_data = pl.DataFrame(
[
pl.Series( id , , dtype=pl.UInt8()),
pl.Series( red_data , , dtype=pl.UInt8()),
]
)
blue_data = pl.DataFrame(
[
pl.Series( id , , dtype=pl.UInt8()),
pl.Series( blue_data , ,...

0 Ответы

155 Просмотры

Последнее сообщение Anonymous
12 дек 2024, 19:33
Объединение двух фреймов данных, которые имеют общие «индексные столбцы» (столбцы идентификаторов), но не столбцы данных

Последнее сообщение Anonymous « 12 дек 2024, 21:11
Добавлено в форуме Python

Anonymous » 12 дек 2024, 21:11 » в форуме Python

Я делаю это:
import polars as pl
import sys

red_data = pl.DataFrame(
[
pl.Series( id , , dtype=pl.UInt8()),
pl.Series( red_data , , dtype=pl.UInt8()),
]
)
blue_data = pl.DataFrame(
[
pl.Series( id , , dtype=pl.UInt8()),
pl.Series( blue_data , ,...

0 Ответы

127 Просмотры

Последнее сообщение Anonymous
12 дек 2024, 21:11
Экспорт DataFrame как файл CSV из Google Colab в Google Drive

Последнее сообщение Anonymous « 06 май 2025, 17:22
Добавлено в форуме Python

Anonymous » 06 май 2025, 17:22 » в форуме Python

Я хочу загрузить DataFrame в качестве CSV от Colab в Google Drive. Я много пробовал, но не повезло. Я могу загрузить простой текстовый файл, но не удалось загрузить CSV.

Я попробовал следующий код:

import pandas as pd
df=pd.DataFrame({1: })...

0 Ответы

5 Просмотры

Последнее сообщение Anonymous
06 май 2025, 17:22
BeautifulSoup4 – Как мне получить детей следующего брата или сестры?

Последнее сообщение Anonymous « 24 окт 2023, 16:44
Добавлено в форуме Python

Anonymous » 24 окт 2023, 16:44 » в форуме Python

Я пытаюсь запрограммировать приложение, которое будет собирать некоторые финансовые формы из SEC, которые часто имеют другую структуру, что может усложнить задачу. В результате я пытаюсь обобщить свой парсер. Многие мои проблемы сводятся к тому,...

0 Ответы

67 Просмотры

Последнее сообщение Anonymous
24 окт 2023, 16:44

Вернуться в «Python»