Я ищу публичный список мировых банков
Мне не нужны филиалы и полные адреса, а только название и веб-сайт. Я думаю о данных... XML, CSV... со следующими полями:
название банка, название страны или код страны (две буквы ISO) Веб-сайт: необязательно: город штаб-квартиры банка Для каждого банка по одной записи на страну присутствия. Кстати: особенно интересны небольшие банки
Я нашел отличную страницу, которая очень и очень информативна - видите - на ней 9000 банков в Европе:
смотреть от а до я:
https://thebanks.eu/search
Код: Выделить всё
**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla
**B**
https://thebanks.eu/search?bank=&country=Belgium
**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom
i need to have this data
Contacts
Код: Выделить всё
Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/
btw: perhaps we can count form zero to 100 000 - in order to get all the bank that are stored inthe db:
see a detailed page: https://thebanks.eu/banks/9563
i run on colab this:
Код: Выделить всё
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")
for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)
# and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# subsequently we print the DataFrame
print(df)
Код: Выделить всё
Empty DataFrame
Columns: []
Index: []
well should also print out some extra debug information to help diagnose the problem.
update: good evening dear @Asish M. and @eternal_white
thank you very much for your comments and for sharing your ideas: Food for thoughts: As for the Selenium - i think that this is a good idea - and for running it (selenium) on Google-Colab i have learned from Jacob Padilla
@Jacob / @user:21216449 :: see Jacobs page: https://github.com/jpjacobpadilla with the Google-Colab-Selenium: https://github.com/jpjacobpadilla/Google-Colab-Selenium and the default Options:
Код: Выделить всё
The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.
were using Selenium in Google Colab to bypass the Cloudflare (that you mentioned - eternal_white) blocking and scrape the desired data can be nice but feasible approach. Вот некоторые мысли о пошаговом подходе и о том, как его настроить с помощью пакета google-colab-selenium от Джейкоба Падиллы:
Код: Выделить всё
Install google-colab-selenium:
You can install the google-colab-selenium package using pip:
diff
we also need to install Selenium:
diff
Код: Выделить всё
!pip install selenium
Import Necessary Libraries:
Import the required libraries in your Colab notebook:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time
Configure the Chrome WebDriver with the necessary options:
Код: Выделить всё
# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)
we define a function to scrape bank data using Selenium:
Код: Выделить всё
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # first of all - we let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
Код: Выделить всё
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))
# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
Код: Выделить всё
# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222') # Add this option
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)
# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # Let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": веб-сайт
# Список URL-адресов для очистки банковских данных по странам
urls = [
"https://thebanks.eu/search",
"https: //thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# При необходимости добавьте дополнительные URL-адреса для других стран
]
# Список для хранения банковских данных
bank_data = []
# Перебор URL-адресов и очистка банковских данных
для поиска URL-адресов в URL-адресах:
bank_data.append(scrape_bank_data_with_selenium(url))
# Преобразование списка словарей в DataFrame pandas
df = pd.DataFrame(bank_data)
# Распечатайте DataFrame
print(df)
# Закройте WebDriver
driver.quit()
Код: Выделить всё
TypeError Traceback (most recent call last)
in ()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium
TypeError: WebDriver.__init__() got multiple values for argument 'options'
Источник: https://stackoverflow.com/questions/781 ... -now-using