Я не могу найти URL -адрес запроса с информацией JSON во время соскоба в Интернете

Я не могу найти URL -адрес запроса с информацией JSON во время соскоба в Интернете ⇐ Html

1 сообщение • Страница 1 из 1

Anonymous

Я не могу найти URL -адрес запроса с информацией JSON во время соскоба в Интернете

Цитата

Сообщение Anonymous » 15 мар 2025, 20:00

Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.
I was hoping to find an API gateway of it from XHR or JS of website, but I didn`t manage to find anything there, so i decided to scrape data from each page using this code (with the help of CHATGPT), который извлекает данные с каждой страницы и сохраняет их в файл .csv: < /p>
async def scrape_page(session, page):
"""Scrape data from a single page."""
url = BASE_URL + str(page)
html = await fetch_page(session, url, page)

if not html:
logging.error(f"

Failed to fetch page {page}. Skipping...")
return [], None # Skip if page fails

soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})

if not table:
logging.warning(f"

No table found on page {page}. Check if the structure has changed!")
return [], None

titles = [th.text.strip() for th in table.find_all("th")]
rows = table.find_all("tr")[1:] # Skip first row (headers)
data = [[td.text.strip() for td in row.find_all("td")] for row in rows]

print(f"

Scraped {len(data)} records from page {page}") # DEBUG PRINT
return data, titles
< /code>
То, с чем я столкнулся, - это то, что слишком много страниц (приблизительно 20 000 страниц с 15 рядами данных на каждой странице), и мне слишком много времени требуется, чтобы соскребить все это. Есть предложения о том, как я мог бы оптимизировать этот процесс?

Подробнее здесь: https://stackoverflow.com/questions/795 ... b-scraping

1742058030

Anonymous

Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.
I was hoping to find an API gateway of it from XHR or JS of website, but I didn`t manage to find anything there, so i decided to scrape data from each page using this code (with the help of CHATGPT), который извлекает данные с каждой страницы и сохраняет их в файл .csv: < /p>
async def scrape_page(session, page):
"""Scrape data from a single page."""
url = BASE_URL + str(page)
html = await fetch_page(session, url, page)

if not html:
logging.error(f"❌ Failed to fetch page {page}. Skipping...")
return [], None  # Skip if page fails

soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})

if not table:
logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
return [], None

titles = [th.text.strip() for th in table.find_all("th")]
rows = table.find_all("tr")[1:]  # Skip first row (headers)
data = [[td.text.strip() for td in row.find_all("td")] for row in rows]

print(f"✅ Scraped {len(data)} records from page {page}")  # DEBUG PRINT
return data, titles
< /code>
То, с чем я столкнулся, - это то, что слишком много страниц (приблизительно 20 000 страниц с 15 рядами данных на каждой странице), и мне слишком много времени требуется, чтобы соскребить все это. Есть предложения о том, как я мог бы оптимизировать этот процесс?  

Подробнее здесь: [url]https://stackoverflow.com/questions/79511471/i-cant-find-request-url-with-json-information-while-web-scraping[/url]