Веб-очистка Python – показаны только заголовки без данных

Веб-очистка Python – показаны только заголовки без данных ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Веб-очистка Python – показаны только заголовки без данных

Цитата

Сообщение Anonymous » 15 янв 2025, 22:58

Я нашел исходный код Python https://www.actowizsolutions.com/scrape ... s-data.php.
Я начал использовать это для моего проекта очистки, однако кажется, что когда я запускаю этот скрипт, он показывает только два заголовка; Вопросы и ответы без данных.

Код: Выделить всё

# Directory to start in
workingdirectory = '/home/webscrape'

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL of the page containing the list of requests
url = "https://www.whatdotheyknow.com/search/O365/all"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:

# Parse the HTML content of the page
soup = BeautifulSoup(response.text, "html.parser")

# Find links to individual request pages
request_links = soup.find_all("a", class_="search-result-heading")

# Initialize lists to store data
questions = []
answers = []

# Loop through each request link and extract data
for link in request_links:
request_url = link["href"]
response = requests.get(request_url)

if response.status_code == 200:
request_soup = BeautifulSoup(response.text, "html.parser")

# Extract the question (title) and answer
question = request_soup.find("h2", class_= "name").text.strip()
answer = request_soup.find("div", class_= "public-description").text.strip()

questions.append(question)
answers.append(answer)

# Create a DataFrame to store the data
data = {"Question": questions, "Answer": answers}
df = pd.DataFrame(data)

# Save the data to CSV or XLSX
df.to_csv("whatdotheyknow_data.csv", index=False)
# df.to_excel ("whatdotheyknow_data.xlsx", index=False, engine="openpyx1")

print("Data scraped and saved successfully.")
else:
print("Failed to retrieve the web page.")

# Close the HTTP session
response.close()

Я знаю, что данные есть, поскольку я вижу результаты на веб-страницах, а также в источнике страницы (см. ниже) и источнике просмотра веб-ссылки: https:/ /www.whatdotheyknow.com/search/O365/all.

Код: Выделить всё

FOI requests 1 to 25 of about 1000






[url=/request/malicious_email_volume_93#incoming-1926496]Malicious email volume[/url]



Response by [url=https://www.whatdotheyknow.com/body/ccrc]Criminal Cases Review Commission[/url] to [url=https://www.whatdotheyknow.com/user/rebecca_moody]Rebecca Moody[/url] on 30 November 2021.

Может ли кто-нибудь указать, что я делаю неправильно или нужно внести небольшие изменения в код.
Заранее спасибо

Подробнее здесь: https://stackoverflow.com/questions/793 ... ly-headers

1736971120

Anonymous

Я нашел исходный код Python https://www.actowizsolutions.com/scrape-freedom-of-information-request-portals-data.php.
Я начал использовать это для моего проекта очистки, однако кажется, что когда я запускаю этот скрипт, он показывает только два заголовка; Вопросы и ответы без данных.
[code]# Directory to start in
workingdirectory = '/home/webscrape'

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL of the page containing the list of requests
url = "https://www.whatdotheyknow.com/search/O365/all"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:

# Parse the HTML content of the page
soup = BeautifulSoup(response.text, "html.parser")

# Find links to individual request pages
request_links = soup.find_all("a", class_="search-result-heading")

# Initialize lists to store data
questions = []
answers = []

# Loop through each request link and extract data
for link in request_links:
request_url = link["href"]
response = requests.get(request_url)

if response.status_code == 200:
request_soup = BeautifulSoup(response.text, "html.parser")

# Extract the question (title) and answer
question = request_soup.find("h2", class_= "name").text.strip()
answer = request_soup.find("div", class_= "public-description").text.strip()

questions.append(question)
answers.append(answer)

# Create a DataFrame to store the data
data = {"Question": questions, "Answer": answers}
df = pd.DataFrame(data)

# Save the data to CSV or XLSX
df.to_csv("whatdotheyknow_data.csv", index=False)
# df.to_excel ("whatdotheyknow_data.xlsx", index=False, engine="openpyx1")

print("Data scraped and saved successfully.")
else:
print("Failed to retrieve the web page.")

# Close the HTTP session
response.close()
[/code]
Я знаю, что данные есть, поскольку я вижу результаты на веб-страницах, а также в источнике страницы (см. ниже) и источнике просмотра веб-ссылки: https:/ /www.whatdotheyknow.com/search/O365/all.
[code]
FOI requests 1 to 25 of about 1000






[url=/request/malicious_email_volume_93#incoming-1926496]Malicious email volume[/url]



Response by [url=https://www.whatdotheyknow.com/body/ccrc]Criminal Cases Review Commission[/url] to [url=https://www.whatdotheyknow.com/user/rebecca_moody]Rebecca Moody[/url] on 30 November 2021.

[/code]
Может ли кто-нибудь указать, что я делаю неправильно или нужно внести небольшие изменения в код.
Заранее спасибо 

Подробнее здесь: [url]https://stackoverflow.com/questions/79359515/python-web-scrape-showing-no-data-only-headers[/url]