Веб-скрапинг: код работает, но недостаточно питоничен. Что я могу сделать или чему научиться? [закрыто]

Веб-скрапинг: код работает, но недостаточно питоничен. Что я могу сделать или чему научиться? [закрыто] ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Веб-скрапинг: код работает, но недостаточно питоничен. Что я могу сделать или чему научиться? [закрыто]

Цитата

Сообщение Anonymous » 03 мар 2024, 16:15

So I'm here trying to learn new tricks on my spare time and I ended with Python. Scraping information from websites could even be of value to may actual work (I'm mainly an ER doctor but I also do clinical audits).

The below code was produced with a lot of documentation read and the occasional help from people that know a lot more than I.

import bs4 import requests import pandas as pd from typing import Iterator # def revistas = get links to each individual journal def revistas(): f = requests.get(url) bs = bs4.BeautifulSoup(f.content,"lxml") for revista in bs.find_all('a', class_= 'title'): revistas_links.append (revista['href']) return revistas_links # def artigos = from each journal issue fetch all individual articles listed def artigos(): revistas_links=revistas() n_revistas=len(revistas_links) print ('Número de revistas encontradas:',n_revistas) for index, url in enumerate(revistas_links,1): print('A obter artigos da revista n:',index,'de',n_revistas, 'em', url) revista=requests.get(url) texto = bs4.BeautifulSoup(revista.content,"lxml") data = texto.findAll('h3',attrs={'class':'title'}) for div in data: for a in div.findAll('a'): links_artigos.append(a['href']) return links_artigos # and finally the metadata extraction code def dados_artigos(self): links_artigos = artigos() n_artigos = len(links_artigos) print('Número de artigos encontrados:',n_artigos) for index, url in enumerate(links_artigos,1): print('A obter dados do artigo n:',index,'de',n_artigos,'em',url) artigo_web=requests.get(url) dom = bs4.BeautifulSoup(markup=artigo_web.text, features='lxml') revista = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(3) a').text.strip() publicado = dom.select_one('div.published span').text.strip() titulo = dom.select_one('h1.page_title').text.strip() seccao = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(4) span').text.strip() citacao = dom.select_one('div.csl-entry').text.strip() for name in dom.find_all(name='span', class_='name'): # Search through siblings for a matching affiliation tag for affiliation in name.find_next_siblings(name='span'): name_str = name.text.strip() class_ = affiliation.attrs.get('class', ())[0] if class_ == 'affiliation': # If we've found an affiliation class on the soonest span sibling, use it yield revista, publicado, titulo, seccao, name_str, affiliation.text.strip(), citacao break elif class_ == 'name': # If we've encountered the next name, there is no affiliation. yield revista, publicado, titulo, seccao, name_str, None, citacao break else: # If there are no span siblings, there is no affiliation. yield revista, publicado, titulo, seccao, name.text.strip(), None, citacao revistas_links=[] links_artigos=[] url = 'https://rpmgf.pt/ojs/index.php/rpmgf/issue/archive' df = pd.DataFrame(data=dados_artigos(artigos), columns=['Revista', 'Data de Publicação', 'Artigo', 'Secçao', 'Autor', 'Afiliação', 'Citação']) df.to_csv('file_name.csv', sep='|', encoding='utf-8') It does the job!
The "Problem"
I believe the code could be better:

Even though the dataframe choice was an easy way of producing a csv file, I feel it may add unnecessary steps.
I also could not make it work without the variables declaration (revistas_links=[] and links_artigos=[]).
There is no __main__ anywhere.
And I know I don't really need 3 functions: but it was a way of learning how to pass values from one function to another.

And to added complexity, I'm still figuring out how can I use the len(links_artigos) on a new function to print a progress bar or something like that. I've seen code that does that but I'm still nowhere near on how to implement it.

So... comments on my code? What can be improved (or what should I read next).

Источник: https://stackoverflow.com/questions/780 ... do-or-lear

1709471732

Anonymous


So I'm here trying to learn new tricks on my spare time and I ended with Python. Scraping information from websites could even be of value to may actual work (I'm mainly an ER doctor but I also do clinical audits).
 
The below code was produced with a lot of documentation read and the occasional help from people that know a lot more than I.
 
import bs4 import requests import pandas as pd from typing import Iterator      # def revistas = get links to each individual journal def revistas():   f = requests.get(url)   bs = bs4.BeautifulSoup(f.content,"lxml")   for revista in bs.find_all('a', class_= 'title'):     revistas_links.append (revista['href'])   return revistas_links # def artigos = from each journal issue fetch all individual articles listed def artigos():   revistas_links=revistas()   n_revistas=len(revistas_links)   print ('Número de revistas encontradas:',n_revistas)   for index, url in enumerate(revistas_links,1):     print('A obter artigos da revista n:',index,'de',n_revistas, 'em', url)     revista=requests.get(url)     texto = bs4.BeautifulSoup(revista.content,"lxml")     data = texto.findAll('h3',attrs={'class':'title'})     for div in data:       for a in div.findAll('a'):         links_artigos.append(a['href'])   return links_artigos # and finally the metadata extraction code def dados_artigos(self):      links_artigos = artigos()     n_artigos = len(links_artigos)     print('Número de artigos encontrados:',n_artigos)     for index, url in enumerate(links_artigos,1):       print('A obter dados do artigo n:',index,'de',n_artigos,'em',url)       artigo_web=requests.get(url)       dom = bs4.BeautifulSoup(markup=artigo_web.text, features='lxml')       revista = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(3) a').text.strip()       publicado = dom.select_one('div.published span').text.strip()       titulo = dom.select_one('h1.page_title').text.strip()       seccao = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(4) span').text.strip()       citacao = dom.select_one('div.csl-entry').text.strip()       for name in dom.find_all(name='span', class_='name'):         # Search through siblings for a matching affiliation tag         for affiliation in name.find_next_siblings(name='span'):             name_str = name.text.strip()             class_ = affiliation.attrs.get('class', ())[0]             if class_ == 'affiliation':                 # If we've found an affiliation class on the soonest span sibling, use it                 yield revista, publicado, titulo, seccao, name_str, affiliation.text.strip(), citacao                 break             elif class_ == 'name':                 # If we've encountered the next name, there is no affiliation.                 yield revista, publicado, titulo, seccao, name_str, None, citacao                 break         else:             # If there are no span siblings, there is no affiliation.             yield revista, publicado, titulo, seccao, name.text.strip(), None, citacao revistas_links=[] links_artigos=[] url = 'https://rpmgf.pt/ojs/index.php/rpmgf/issue/archive' df = pd.DataFrame(data=dados_artigos(artigos), columns=['Revista', 'Data de Publicação', 'Artigo', 'Secçao', 'Autor', 'Afiliação', 'Citação']) df.to_csv('file_name.csv', sep='|', encoding='utf-8')  It does the job!
 The "Problem" 
I believe the code could be better:
 [list] [*]Even though the dataframe choice was an easy way of producing a csv file, I feel it may add unnecessary steps. [*]I also could not make it work without the variables declaration (revistas_links=[] and links_artigos=[]). [*]There is no __main__ anywhere. [*]And I know I don't really need 3 functions: but it was a way of learning how to pass values from one function to another. [/list] 
And to added complexity, I'm still figuring out how can I use the len(links_artigos) on a new function to print a progress bar or something like that. I've seen code that does that but I'm still nowhere near on how to implement it.
 
So... comments on my code? What can be improved (or what should I read next).
 

Источник: [url]https://stackoverflow.com/questions/78096318/web-scraping-the-code-works-but-its-not-pythonic-enough-what-can-i-do-or-lear[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Как отредактировать сообщение «Недостаточно акций» или «Недостаточно акций уже в корзине» в Woocommerce 4.5+

Последнее сообщение Anonymous « 22 июн 2025, 17:56
Добавлено в форуме Php

Anonymous » 22 июн 2025, 17:56 » в форуме Php

В моем магазине Woocommerce, когда я пытаюсь добавить в мою корзину большую часть элемента, чем в складе (то есть, если у нас есть 9), я получаю это сообщение ...

Строки «Out of of of of, пожалуйста, свяжитесь с нашим офисом продаж»
Кто-нибудь...

0 Ответы

12 Просмотры

Последнее сообщение Anonymous
22 июн 2025, 17:56
Веб-скрапинг веб-сайта с таблицей с разбивкой на страницы, но без кнопки «Далее»

Последнее сообщение Anonymous « 05 дек 2024, 01:30
Добавлено в форуме Python

Anonymous » 05 дек 2024, 01:30 » в форуме Python

Я пытаюсь получить данные с первых трех страниц таблицы с разбивкой на страницы: Пока могу получить данные только с первой страницы. (У них есть API, но он обновляется только еженедельно, что для меня недостаточно часто).
Вот что у меня есть:
from...

0 Ответы

17 Просмотры

Последнее сообщение Anonymous
05 дек 2024, 01:30
Веб-скрапинг веб-сайта с таблицей с разбивкой на страницы, но без кнопки «Далее»

Последнее сообщение Anonymous « 05 дек 2024, 08:29
Добавлено в форуме Python

Anonymous » 05 дек 2024, 08:29 » в форуме Python

Я пытаюсь получить данные с первых трех страниц таблицы с разбивкой на страницы: Пока могу получить данные только с первой страницы. (У них есть API, но он обновляется только еженедельно, что для меня недостаточно часто).
Вот что у меня есть:
from...

0 Ответы

11 Просмотры

Последнее сообщение Anonymous
05 дек 2024, 08:29
Я пытаюсь научиться программировать, я написал свой первый код HTML и CSS. На ПК все работает нормально, но на моем теле

Последнее сообщение Anonymous « 27 июн 2024, 21:24
Добавлено в форуме CSS

Anonymous » 27 июн 2024, 21:24 » в форуме CSS

На ПК все работает нормально, но на моем телефоне кажется, что часть CSS не работает. страница отображается белым цветом без фона, а текст не отформатирован. Я пытаюсь понять, какие ошибки я допустил? или возможно, что ошибка не в CSS-части кода?...

0 Ответы

21 Просмотры

Последнее сообщение Anonymous
27 июн 2024, 21:24
Я пытаюсь научиться программировать, я написал свой первый код HTML и CSS. На ПК все работает нормально, но на моем теле

Последнее сообщение Anonymous « 27 июн 2024, 21:24
Добавлено в форуме IOS

Anonymous » 27 июн 2024, 21:24 » в форуме IOS

На ПК все работает нормально, но на моем телефоне кажется, что часть CSS не работает. страница отображается белым цветом без фона, а текст не отформатирован. Я пытаюсь понять, какие ошибки я допустил? или возможно, что ошибка не в CSS-части кода?...

0 Ответы

18 Просмотры

Последнее сообщение Anonymous
27 июн 2024, 21:24

Вернуться в «Python»