Библиотека pymupdf позволила мне преобразовать его в текстовый формат. Но я не могу создать функцию, которая автоматически берет указанные годы, создает фрейм данных и заполняет его.
Мой выход из pymupdf таков: :
Код: Выделить всё
'49\xa0\n\xa0\nTable.1.5: Projected Population by Age and Sex of Punch district of Jammu & Kashmir: 2012-2031. \n \nState: Jammu & Kashmir (01) \nDistrict: Punch (05) \nSingle \nages \n2012 \n2013 \n2014 \n2015 \n2016 \nMales \nFemales \nMales \nFemales \nMales \nFemales \nMales \nFemales \nMales \nFemales \nAll ages \n255403 \n227892 \n258924 \n230862 \n262467 \n233842 \n266030 \n236831 \n269615 \n239830 \n0 \n6810 \n5869 \n7038 \n6027 \n7270 \n6187 \n7507 \n6349 \n7748 \n6514 \n1 \n6504 \n5681 \n6665 \n5782 \n6827 \n5884 \n6993 \n5987 \n7160 \n6090 \n2 \n6295 \n5566 \n6402 \n5624 \n6509 \n5681 \n6618 \n5739 \n6727 \n5796 \n3 \n6169 \n5512 \n6234 \n5539 \n6299 \n5564 \n6364 \n5589 \n6429 \n5613 \n4 \n6109 \n5507 \n6145 \n5513 \n6180 \n5518 \n6213 \n5521 \n6246 \n5523 \n5 \n6104 \n5539 \n6120 \n5534 \n6134 \n5528 \n6147 \n5519 \n6159 \n5509 \n6 \n6139 \n5597 \n6143 \n5589 \n6146 \n5579 \n6147 \n5567 \n6146 \n5553 \n7 \n6199 \n5669 \n6199 \n5664 \n6197 \n5657 \n6194 \n5648 \n6189 \n5637 \n8 \n6270 \n5743 \n6272 \n5746 \n6272 \n5747 \n6270 \n5746 \n6267 \n5744 \n9 \n6339 \n5808 \n6347 \n5821 \n6353 \n5834 \n6357 \n5845 \n6360 \n5855 \n10 \n6422 \n5880 \n6441 \n5909 \n6458 \n5937 \n6474 \n5965 \n6488 \n5991 \n11 \n6537 \n5977 \n6572 \n6027 \n6606 \n6076 \n6639 \n6124 \n6671 \n6172 \n12 \n6513 \n5944 \n6559 \n6006 \n6603 \n6067 \n6647 \n6129 \n6691 \n6190 \n13 \n6275 \n5711 \n6320 \n5771 \n6364 \n5830 \n6407 \n5890 \n6450 \n5949 \n14 \n5902 \n5353 \n5939 \n5402 \n5975 \n5451 \n6011 \n5499 \n6046 \n5547 \n15-19 \n24869 \n22364 \n25038 \n22542 \n25205 \n22718 \n25369 \n22891 \n25531 \n23063 \n20-24 \n21309 \n19495 \n21673 \n19816 \n22040 \n20140 \n22411 \n20466 \n22786 \n20794 \n25-29 \n22624 \n18342 \n23386 \n18663 \n24162 \n18988 \n24953 \n19316 \n25758 \n19646 \n30-34 \n18856 \n16118 \n19264 \n16405 \n19677 \n16695 \n20095 \n16988 \n20520 \n17283 \n35-39 \n15284 \n14341 \n15449 \n14496 \n15615 \n14651 \n15781 \n14806 \n15947 \n14961 \n40-44 \n13835 \n12488 \n14108 \n12835 \n14384 \n13188 \n14663 \n13545 \n14946 \n13908 \n45-49 \n10667 \n10434 \n10856 \n10676 \n11047 \n10920 \n11241 \n11168 \n11436 \n11418 \n50-54 \n9121 \n8133 \n9255 \n8418 \n9389 \n8708 \n9523 \n9003 \n9659 \n9303 \n55-59 \n5658 \n5621 \n5777 \n5679 \n5897 \n5736 \n6019 \n5794 \n6143 \n5851 \n60-64 \n6342 \n5434 \n6360 \n5545 \n6377 \n5656 \n6392 \n5769 \n6406 \n5883 \n65-69 \n3539 \n2998 \n3575 \n2964 \n3610 \n2929 \n3646 \n2892 \n3681 \n2853 \n70-74 \n3888 \n2975 \n3892 \n3027 \n3895 \n3080 \n3897 \n3133 \n3897 \n3186 \n75-79 \n1511 \n1309 \n1533 \n1311 \n1555 \n1313 \n1577 \n1315 \n1599 \n1316 \n80+ \n3313 \n2483 \n3366 \n2531 \n3420 \n2580 \n3475 \n2630 \n3530 \n2680 \n \n(Contd…) \nState: Jammu & Kashmir (01) \nDistrict: Punch (05) \nSingle \nages \n2017 \n2018 \n2019 \n2020 \n2021 \nMales \nFemales \nMales \nFemales \nMales \nFemales \nMales \nFemales \nMales \nFemales \nAll ages \n272702 \n242405 \n275806 \n244989 \n278926 \n247581 \n282104 \n250180 \n285258 \n252788 \n0 \n7979 \n6669 \n8213 \n6827 \n8450 \n6986 \n8693 \n7148 \n8938 \n7311 \n1 \n7317 \n6184 \n7475 \n6278 \n7636 \n6374 \n7800 \n6470 \n7965 \n6567 \n2 \n6825 \n5843 \n6924 \n5889 \n7023 \n5935 \n7124 \n5982 \n7226 \n6028 \n3 \n6482 \n5626 \n6535 \n5638 \n6587 \n5649 \n6641 \n5659 \n6693 \n5669 \n4 \n6267 \n5514 \n6286 \n5503 \n6305 \n5491 \n6323 \n5478 \n6340 \n5463 \n5 \n6157 \n5488 \n6154 \n5465 \n6150 \n5440 \n6145 \n5414 \n6139 \n5386 \n6 \n6132 \n5528 \n6116 \n5501 \n6099 \n5472 \n6081 \n5442 \n6061 \n5410 \n7 \n6170 \n5615 \n6149 \n5591 \n6127 \n5565 \n6104 \n5538 \n6078 \n5510 \n8 \n6249 \n5730 \n6230 \n5714 \n6210 \n5697 \n6188 \n5679 \n6164 \n5659 \n9 \n6349 \n5853 \n6336 \n5849 \n6322 \n5845 \n6308 \n5839 \n6290 \n5832 \n10 \n6489 \n6005 \n6488 \n6019 \n6486 \n6031 \n6484 \n6043 \n6479 \n6054 \n11 \n6689 \n6209 \n6706 \n6245 \n6723 \n6280 \n6739 \n6315 \n6754 \n6350 \n12 \n6721 \n6240 \n6750 \n6290 \n6778 \n6339 \n6807 \n6389 \n6834 \n6438 \n13 \n6480 \n5997 \n6509 \n6046 \n6537 \n6094 \n6566 \n6142 \n6593 \n6190 \n14 \n6069 \n5585 \n6091 \n5622 \n6112 \n5659 \n6133 \n5696 \n6153 \n5733 \n15-19 \n25640 \n23191 \n25748 \n23317 \n25852 \n23442 \n25958 \n23565 \n26057 \n23686 \n20-24 \n23120 \n21088 \n23458 \n21383 \n23798 \n21681 \n24145 \n21981 \n24492 \n22283 \n25-29 \n26527 \n19944 \n27309 \n20245 \n28103 \n20547 \n28913 \n20853 \n29733 \n21161 \n30-34 \n20910 \n17550 \n21305 \n17819 \n21704 \n18091 \n22112 \n18364 \n22522 \n18641 \n35-39 \n16082 \n15088 \n16218 \n15216 \n16353 \n15343 \n16490 \n15470 \n16625 \n15596 \n40-44 \n15203 \n14251 \n15463 \n14598 \n15726 \n14950 \n15994 \n15307 \n16263 \n15668 \n45-49 \n11611 \n11651 \n11788 \n11887 \n11967 \n12126 \n12149 \n12367 \n12331 \n12611 \n50-54 \n9778 \n9591 \n9897 \n9883 \n10016 \n10180 \n10138 \n10481 \n10260 \n10786 \n55-59 \n6256 \n5897 \n6371 \n5943 \n6487 \n5989 \n6606 \n6035 \n6724 \n6081 \n60-64 \n6406 \n5987 \n6405 \n6093 \n6403 \n6199 \n6400 \n6307 \n6395 \n6416 \n65-69 \n3710 \n2808 \n3738 \n2761 \n3766 \n2712 \n3795 \n2662 \n3823 \n2610 \n70-74 \n3890 \n3235 \n3881 \n3283 \n3871 \n3332 \n3861 \n3382 \n3849 \n3432 \n75-79 \n1619 \n1315 \n1638 \n1313 \n1657 \n1311 \n1677 \n1309 \n1697 \n1306 \n80+ \n3578 \n2725 \n3627 \n2771 \n3677 \n2818 \n3727 \n2865 \n3778 \n2912 \n \n'
Первый просто читает PDF-файл и по какой-то причине превращает его в строку:
Код: Выделить всё
# part 1
import pymupdf
doc = pymupdf.open(r"pop_projection.pdf")
page = doc[0] # page '0' only because I want to make sure it works on a single page first then the rest of the pdf.
text = page.get_text()
print(text)
Код: Выделить всё
# part 2
# function that turn previous output into a list:
def extract_states(text):
# Split the text into lines
lines = text.split('\\n')
result = []
for line in lines:
# Check if "State:" is present in the line
if "State:" in line:
# Find the starting point after "State:"
start_idx = line.find("State:") + len("State:")
# Extract the content following "State:" until "(Contd…)" or end of line
state_data = line[start_idx:].split("(Contd…)")[0].strip()
# If there is valid data, split it into separate lines and extend the result list
if state_data:
result.extend(state_data.split("\n"))
# Remove any empty strings resulting from splitting
result = [item.strip() for item in result if item.strip()]
return result
Код: Выделить всё
# part 3
def fill_dataframe(data_list):
# Initialize column names and DataFrame from the existing create_dataframe function
columns = [
'State',
'District',
'Single Ages', # This will be populated from row_data[0]
f"{data_list[4]}_males", f"{data_list[4]}_females",
f"{data_list[5]}_males", f"{data_list[5]}_females",
f"{data_list[6]}_males", f"{data_list[6]}_females",
f"{data_list[7]}_males", f"{data_list[7]}_females",
f"{data_list[8]}_males", f"{data_list[8]}_females"
]
# Define the fixed values for 'State' and 'District'
state = data_list[0]
district = data_list[1]
# Initialize an empty list to hold each row as a dictionary
rows = []
# Set the start of the data extraction based on the structure you've described
start_index = 19
row_size = 11 # Number of elements per row for male and female columns
# Populate rows based on the sequence in data_list
while start_index < len(data_list):
# Extract one row of male-female data (11 elements at a time)
row_data = data_list[start_index:start_index + row_size]
# Prepare the row dictionary
row_dict = {
'State': state,
'District': district,
'Single Ages': row_data[0], # Populate 'Single Ages' with the first element of row_data
# Populate columns dynamically from row_data
f"{data_list[4]}_males": row_data[1],
f"{data_list[4]}_females": row_data[2],
f"{data_list[5]}_males": row_data[3],
f"{data_list[5]}_females": row_data[4],
f"{data_list[6]}_males": row_data[5],
f"{data_list[6]}_females": row_data[6],
f"{data_list[7]}_males": row_data[7],
f"{data_list[7]}_females": row_data[8],
f"{data_list[8]}_males": row_data[9],
f"{data_list[8]}_females": row_data[10]
}
# Add the row dictionary to the list of rows
rows.append(row_dict)
# Move to the next set of data for the following row
start_index += row_size
# Create the DataFrame from the list of row dictionaries
df = pd.DataFrame(rows, columns=columns)
return df
Код: Выделить всё
list_1 = extract_states(text)
fill_dataframe(list_1)
[img]https://i.sstatic. net/MBHZrfDp.png[/img]
Где мне теперь нужна помощь? Я не понимаю, почему часть 2 не работает должным образом.
Подробнее здесь: https://stackoverflow.com/questions/791 ... aphic-data
Мобильная версия