BeautifulSoup4 и Pandas возвращают пустые столбцы DataFrame: обновление: теперь используется Selenium в Google-ColabPython

Программы на Python
Ответить Пред. темаСлед. тема
Гость
 BeautifulSoup4 и Pandas возвращают пустые столбцы DataFrame: обновление: теперь используется Selenium в Google-Colab

Сообщение Гость »


Я ищу публичный список мировых банков
Мне не нужны филиалы и полные адреса, а только название и веб-сайт. Я думаю о данных... XML, CSV... со следующими полями:
название банка, название страны или код страны (две буквы ISO) Веб-сайт: необязательно: город штаб-квартиры банка Для каждого банка по одной записи на страну присутствия. Кстати: особенно интересны небольшие банки
Я нашел отличную страницу, которая очень и очень информативна - видите - на ней 9000 банков в Европе:
смотреть от а до я:
https://thebanks.eu/search

Код: Выделить всё

**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium

**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom
see a detailed page: https://thebanks.eu/banks/9563
i need to have this data
Contacts

Код: Выделить всё

Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/
approach: my approach is to use bs4, request and pandas
btw: perhaps we can count form zero to 100 000 - in order to get all the bank that are stored inthe db:
see a detailed page: https://thebanks.eu/banks/9563
i run on colab this:

Код: Выделить всё

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
#  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")

for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)
see what is getting back

Код: Выделить всё

Empty DataFrame
Columns: []
Index: []
well it seems to me that there is an issue with the scraping process. i tried some different approachs by inspecting the elements on the webpage again and again to ensure i am extracting the correct information on the page.
well should also print out some extra debug information to help diagnose the problem.
update: good evening dear @Asish M. and @eternal_white
thank you very much for your comments and for sharing your ideas: Food for thoughts: As for the Selenium - i think that this is a good idea - and for running it (selenium) on Google-Colab i have learned from Jacob Padilla
@Jacob / @user:21216449 :: see Jacobs page: https://github.com/jpjacobpadilla with the Google-Colab-Selenium: https://github.com/jpjacobpadilla/Google-Colab-Selenium and the default Options:

Код: Выделить всё

The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.
well i think that this approach is worth to think about: so we can go like so:
were using Selenium in Google Colab to bypass the Cloudflare (that you mentioned - eternal_white) blocking and scrape the desired data can be nice but feasible approach. Вот некоторые мысли о пошаговом подходе и о том, как его настроить с помощью пакета google-colab-selenium от Джейкоба Падиллы:

Код: Выделить всё

Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff
!pip install google-colab-selenium
we also need to install Selenium:
diff

Код: Выделить всё

!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time
and then we need to setup Selenium WebDriver:
Configure the Chrome WebDriver with the necessary options:

Код: Выделить всё

# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)
here we re going to define Function for Scraping:
we define a function to scrape bank data using Selenium:

Код: Выделить всё

def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # first of all - we let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website": website}
and then we could go and scrape Data: Now we can scrape the data using the defined function:

Код: Выделить всё

# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)
and - in one single shot:

Код: Выделить всё

# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5)  # Let the page load completely

bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")

return {"Bank Name": bank_name, "Country": country, "Website":  веб-сайт

# Список URL-адресов для очистки банковских данных по странам
urls = [
"https://thebanks.eu/search",
"https: //thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# При необходимости добавьте дополнительные URL-адреса для других стран
]

# Список для хранения банковских данных
bank_data = []

# Перебор URL-адресов и очистка банковских данных
для поиска URL-адресов в URL-адресах:
bank_data.append(scrape_bank_data_with_selenium(url))

# Преобразование списка словарей в DataFrame pandas
df = pd.DataFrame(bank_data)

# Распечатайте DataFrame
print(df)

# Закройте WebDriver
driver.quit()
< br />посмотрите, что я получил – на google-colab:

Код: Выделить всё

TypeError                                 Traceback (most recent call last)

 in ()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'


Источник: https://stackoverflow.com/questions/781 ... -now-using
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение

Вернуться в «Python»