So I'm here trying to learn new tricks on my spare time and I ended with Python. Scraping information from websites could even be of value to may actual work (I'm mainly an ER doctor but I also do clinical audits).
The below code was produced with a lot of documentation read and the occasional help from people that know a lot more than I.
import bs4 import requests import pandas as pd from typing import Iterator # def revistas = get links to each individual journal def revistas(): f = requests.get(url) bs = bs4.BeautifulSoup(f.content,"lxml") for revista in bs.find_all('a', class_= 'title'): revistas_links.append (revista['href']) return revistas_links # def artigos = from each journal issue fetch all individual articles listed def artigos(): revistas_links=revistas() n_revistas=len(revistas_links) print ('Número de revistas encontradas:',n_revistas) for index, url in enumerate(revistas_links,1): print('A obter artigos da revista n:',index,'de',n_revistas, 'em', url) revista=requests.get(url) texto = bs4.BeautifulSoup(revista.content,"lxml") data = texto.findAll('h3',attrs={'class':'title'}) for div in data: for a in div.findAll('a'): links_artigos.append(a['href']) return links_artigos # and finally the metadata extraction code def dados_artigos(self): links_artigos = artigos() n_artigos = len(links_artigos) print('Número de artigos encontrados:',n_artigos) for index, url in enumerate(links_artigos,1): print('A obter dados do artigo n:',index,'de',n_artigos,'em',url) artigo_web=requests.get(url) dom = bs4.BeautifulSoup(markup=artigo_web.text, features='lxml') revista = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(3) a').text.strip() publicado = dom.select_one('div.published span').text.strip() titulo = dom.select_one('h1.page_title').text.strip() seccao = dom.select_one('nav.cmp_breadcrumbs li:nth-of-type(4) span').text.strip() citacao = dom.select_one('div.csl-entry').text.strip() for name in dom.find_all(name='span', class_='name'): # Search through siblings for a matching affiliation tag for affiliation in name.find_next_siblings(name='span'): name_str = name.text.strip() class_ = affiliation.attrs.get('class', ())[0] if class_ == 'affiliation': # If we've found an affiliation class on the soonest span sibling, use it yield revista, publicado, titulo, seccao, name_str, affiliation.text.strip(), citacao break elif class_ == 'name': # If we've encountered the next name, there is no affiliation. yield revista, publicado, titulo, seccao, name_str, None, citacao break else: # If there are no span siblings, there is no affiliation. yield revista, publicado, titulo, seccao, name.text.strip(), None, citacao revistas_links=[] links_artigos=[] url = 'https://rpmgf.pt/ojs/index.php/rpmgf/issue/archive' df = pd.DataFrame(data=dados_artigos(artigos), columns=['Revista', 'Data de Publicação', 'Artigo', 'Secçao', 'Autor', 'Afiliação', 'Citação']) df.to_csv('file_name.csv', sep='|', encoding='utf-8') It does the job!
The "Problem"
I believe the code could be better:
- Even though the dataframe choice was an easy way of producing a csv file, I feel it may add unnecessary steps.
- I also could not make it work without the variables declaration (revistas_links=[] and links_artigos=[]).
- There is no __main__ anywhere.
- And I know I don't really need 3 functions: but it was a way of learning how to pass values from one function to another.
And to added complexity, I'm still figuring out how can I use the len(links_artigos) on a new function to print a progress bar or something like that. I've seen code that does that but I'm still nowhere near on how to implement it.
So... comments on my code? What can be improved (or what should I read next).
Источник: https://stackoverflow.com/questions/780 ... do-or-lear