Я начал использовать это для моего проекта очистки, однако кажется, что когда я запускаю этот скрипт, он показывает только два заголовка; Вопросы и ответы без данных.
Код: Выделить всё
# Directory to start in
workingdirectory = '/home/webscrape'
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Define the URL of the page containing the list of requests
url = "https://www.whatdotheyknow.com/search/O365/all"
# Send an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, "html.parser")
# Find links to individual request pages
request_links = soup.find_all("a", class_="search-result-heading")
# Initialize lists to store data
questions = []
answers = []
# Loop through each request link and extract data
for link in request_links:
request_url = link["href"]
response = requests.get(request_url)
if response.status_code == 200:
request_soup = BeautifulSoup(response.text, "html.parser")
# Extract the question (title) and answer
question = request_soup.find("h2", class_= "name").text.strip()
answer = request_soup.find("div", class_= "public-description").text.strip()
questions.append(question)
answers.append(answer)
# Create a DataFrame to store the data
data = {"Question": questions, "Answer": answers}
df = pd.DataFrame(data)
# Save the data to CSV or XLSX
df.to_csv("whatdotheyknow_data.csv", index=False)
# df.to_excel ("whatdotheyknow_data.xlsx", index=False, engine="openpyx1")
print("Data scraped and saved successfully.")
else:
print("Failed to retrieve the web page.")
# Close the HTTP session
response.close()
Код: Выделить всё
FOI requests 1 to 25 of about 1000
[url=/request/malicious_email_volume_93#incoming-1926496]Malicious email volume[/url]
Response by [url=https://www.whatdotheyknow.com/body/ccrc]Criminal Cases Review Commission[/url] to [url=https://www.whatdotheyknow.com/user/rebecca_moody]Rebecca Moody[/url] on 30 November 2021.
Заранее спасибо
Подробнее здесь: https://stackoverflow.com/questions/793 ... ly-headers
Мобильная версия