Я прикрепил ниже моего журнала того, что происходит:
Код: Выделить всё
2024-10-21 17:43:47: [INFO] Spider opened
2024-10-21 17:43:47: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-21 17:43:47: [INFO] Telnet console listening on 127.0.0.1:6027
2024-10-21 17:43:50: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22177 using 0 to output -1
2024-10-21 17:43:51: [DEBUG] Crawled (200) (referer: None)
2024-10-21 17:43:54: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22180 using 0 to output -1
2024-10-21 17:43:55: [DEBUG] Crawled (200) (referer: None)
2024-10-21 17:43:55: [DEBUG] Crawled (200) (referer: chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html)
Чтобы было интереснее, я запустил код на другом компьютере, и там он работал нормально. с теми же версиями пакетов: Scrapy 2.11.2 и Seleniumbase 4.28.5.
Это паук:
Код: Выделить всё
from scrapy import Request, Spider
from scrapy.http.response.html import HtmlResponse
class Production(Spider):
name = "atp_production"
start_urls = [
"https://www.atptour.com/en/-/tournaments/calendar/tour",
"https://www.atptour.com/en/-/tournaments/calendar/challenger",
]
def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
callback=self._parse_calendar,
meta=dict(dont_redirect=True),
)
def _parse_calendar(self, response: HtmlResponse):
json_str = response.xpath("//body//text()").get()
Код: Выделить всё
class SeleniumBase:
@classmethod
def from_crawler(cls, crawler: Crawler):
middleware = cls(crawler.settings)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def __init__(self, settings: dict[str, Any]) -> None:
self.driver = sb.Driver(
uc=settings.get("UNDETECTABLE", None),
headless=settings.get("HEADLESS", None),
user_data_dir=settings.get("USER_DATA_DIR", None),
)
def spider_closed(self, *_) -> None:
self.driver.quit()
def process_request(self, request: Request, spider: Spider) -> HtmlResponse:
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding="utf-8",
request=request,
)
Подробнее здесь: https://stackoverflow.com/questions/791 ... n-urls-tha