Python PD -обработка DataFrame очень медленная

Python PD -обработка DataFrame очень медленная ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Python PD -обработка DataFrame очень медленная

Цитата

Сообщение Anonymous » 17 июн 2025, 13:27

Привет, я ищу сустейцы, чтобы сделать мой сценарий Python более эффективным. В конце года я ожидаю, что входной Excel удвоится, поэтому мне нужно найти способ улучшить обработку, чтобы сократить время проходов до , по крайней мере, 5 минут.
Эта функция принимает шаблон (plantilla.df) со всеми колоннами и форматом, который я хочу для вывода Excel. Если столбцы такие же, как и столбцы шаблона, он просто копирует данные из входного файла в файл шаблона.def get_bos_values(Bos):
Bos = str(Bos).strip()
mapping_row = bos_mapping_df.loc[bos_mapping_df['BOS'] == Bos]
if not mapping_row.empty:
account_value = str(mapping_row.iloc[0]['ACCOUNT']).strip()
account_name_value = str(mapping_row.iloc[0]['ACCOUNT NAME']).strip()
return account_value, account_name_value
return None, None

def get_type(CdG):
CdG=str(CdG).strip()
for i, source_value in enumerate(type_mapping_df['Source_value']):
source_value = str(source_value).strip()
if source_value in CdG:

return type_mapping_df.iloc['Target_value']
return None

def determine_inter_branch_flag(Type, Producto, group_code, interbusiness_mapping_df):
if not Type:
return None
elif Type not in ['Assets', 'Liabilities']:
return 'No - non Assets & Liability'
elif Type in ['Assets', 'Liabilities']:
if str(Producto).startswith('LIF'):
return 'No - LIF'
else:
if group_code in interbusiness_mapping_df['Source_value'].values:
return 'Yes'
else:
return 'No'
return None```
< /code>
def create_excel():
try:
output_df = plantilla_df.copy()
output_rows = []

for index, row in combined_m1.iterrows():
temp_row = {}
for col_name in plantilla_df.columns:
if col_name in combined_m1fmis.columns:
temp_row[col_name] = row[col_name]

elif col_name == 'CdG':
temp_row['CdG'], temp_row['CdG desc'] = get_bos_values(row['Cuenta'])

elif col_name == 'Type':
temp_row['Type'] = get_type(temp_row['CdG'])

elif col_name == 'inter-branch flag':
temp_row['inter-branch flag'] = determine_inter_branch_flag(temp_row['Type'], row['Product Reference'], row['Group Code'], interbusiness_mapping_df)

output_rows.append(temp_row)

output_df = pd.DataFrame(output_rows, columns=plantilla_df.columns)

template_directory = os.path.dirname(template_file)

output_path_temp = os.path.join(template_directory, f'M1_{date_final}.xlsx')
output_df.to_excel(output_path_temp, index=False)

except Exception as e:
messagebox.showerror("Error", f"Failed to create excel:\n{str(e)}")```
< /code>
Would it help to have all the mapping in the same df? Or to join all the mapping in one function... Do you have any idea how to improve the processing time?
Data looks like this
template
ID ; Cuenta ; CdG ; CdG desc ; Type ; inter-branch flag ; Group Code; Product Reference...
(Everything is empty; NaN)

combined_m1
ID ; Cuenta ; StartDate; EndDate; Group Code; Product Reference
AA1; 01 ; 250612 ; 250920 ; B1 ; 101
AA2; 02 ; 250101 ; 250910 ; B2 ; 102
...

bos_mapping_df
BOS ; ACCOUNT ; ACCOUNT NAME
01 ; PG1 ; Debts
02 ; PG2 ; Taxes

type_mapping_df
SOURCE_VALUE; TARGET_VALUE
PG1 ; Liability
PG2 ; Loss

Expected outcome
ID ; Cuenta ; CdG ; CdG desc ; Type ; inter-branch flag ; Group Code; Product Reference...
AA1; 01 ; PG1 ; Debts; Liability; Yes;B1; 101;...
AA2; 02 ; PG2 ; Taxes; Loss; No - non Assets & Liability; B2 ; 102;...
< /code>
Any suggestions are welcome!!
Profile does not work when using 150k lines excel, the only log i could get was using the sample data (21 lines)
=============================================================
228 155.8 MiB 155.8 MiB 1 @profile(stream=log_file)
229 def create_excel():
230 155.8 MiB 0.0 MiB 1 try:
231 155.8 MiB 0.0 MiB 1 output_df = plantilla_df.copy()
232 155.8 MiB 0.0 MiB 1 output_rows = []
233 155.9 MiB 0.0 MiB 21 for (index, row) in combined_m1fmis.iterrows():
234 155.8 MiB 0.0 MiB 20 temp_row = {}
235 155.8 MiB 0.0 MiB 1160 for col_name in plantilla_df.columns:
236 155.8 MiB 0.0 MiB 1140 if (col_name in combined_m1.columns):
237 155.8 MiB 0.0 MiB 880 temp_row[col_name] = row[col_name]
238 155.8 MiB 0.0 MiB 260 elif (col_name == 'Branch'):
239 155.8 MiB 0.0 MiB 20 temp_row['Branch'] = get_branch_value(row['Codigo'])
240 155.8 MiB 0.0 MiB 240 elif (col_name == 'CdG'):
241 155.8 MiB 0.0 MiB 20 (temp_row['CdG'], temp_row['CdG desc']) = get_bos_values(row['Cuenta'])
242 155.8 MiB 0.0 MiB 220 elif (col_name == 'Type'):
243 155.8 MiB 0.0 MiB 20 temp_row['Type'] = get_type(temp_row['CdG'])
244 155.8 MiB 0.0 MiB 200 elif (col_name == 'BB code'):
245 155.8 MiB 0.0 MiB 20 (temp_row['BB code'], temp_row['BB Desc']) = get_bb(temp_row['CdG'])
246 155.8 MiB 0.0 MiB 180 elif (col_name == 'inter-branch flag'):
247 155.8 MiB 0.0 MiB 20 temp_row['inter-branch flag'] = determine_inter_branch_flag(temp_row['Type'], row['Product Reference'], row['Group Code'], interbusiness_mapping_df)
248 155.8 MiB 0.0 MiB 160 elif (col_name == 'local account'):
249 155.8 MiB 0.0 MiB 20 account_number = str(row['Numero Cuenta'])
250 155.8 MiB 0.0 MiB 20 temp_row['local account'] = account_number[4:(- 3)]
251 155.8 MiB 0.0 MiB 140 elif (col_name == 'Customer Name'):
252 155.8 MiB 0.0 MiB 20 temp_row['Customer Name'] = get_customer_Name(row['Cliente'])
253 155.8 MiB 0.0 MiB 120 elif (col_name == 'Murex check'):
254 155.8 MiB 0.0 MiB 20 temp_row['check'] = get_check(row['Folder'])
255 155.8 MiB 0.0 MiB 100 elif (col_name == 'FX'):
256 155.8 MiB 0.0 MiB 20 temp_row['FX'] = get_fx(row['Amount1'], row['Amount2'])
257 155.8 MiB 0.0 MiB 80 elif (col_name == 'KEY 1'):
258 155.8 MiB 0.0 MiB 20 temp_row['KEY 1'] = f"{temp_row['Branch']} - {row['Contrato']}"
259 155.8 MiB 0.0 MiB 60 elif (col_name == 'KEY 2'):
260 155.8 MiB 0.0 MiB 20 cuenta_value = (row['Ref'] if row['Ref'] else row['Product Reference'])
261 155.8 MiB 0.0 MiB 20 temp_row['KEY 2'] = f"{temp_row['Branch']} - {cuenta_value}"
262 155.8 MiB 0.0 MiB 20 output_rows.append(temp_row)
263 155.9 MiB 0.1 MiB 1 output_df = pd.DataFrame(output_rows, columns=plantilla_df.columns)
264 155.9 MiB 0.0 MiB 1 template_directory = os.path.dirname(template_file)
265 155.9 MiB 0.0 MiB 1 output_path_temp = os.path.join(template_directory, f'M1_{date_final}.xlsx')
266 156.0 MiB 0.0 MiB 1 output_df.to_excel(output_path_temp, index=False)
267 156.0 MiB 0.0 MiB 1 wb = load_workbook(output_path_temp)
268 156.0 MiB 0.0 MiB 1 ws = wb.active
269 156.0 MiB 0.0 MiB 1 plantilla_wb = load_workbook(template_file)
270 156.0 MiB 0.0 MiB 1 plantilla_ws = plantilla_wb['Template']
271 156.0 MiB 0.0 MiB 58 for col in range(1, (len(plantilla_df.columns) + 1)):
272 156.0 MiB 0.0 MiB 57 header_fill = plantilla_ws.cell(row=1, column=col).fill
273 156.0 MiB 0.0 MiB 57 new_fill = PatternFill(start_color=header_fill.start_color, end_color=header_fill.end_color, fill_type=header_fill.fill_type)
274 156.0 MiB 0.0 MiB 57 ws.cell(row=1, column=col).fill = new_fill
275 156.0 MiB 0.0 MiB 1 output_path = os.path.join(template_directory, f'M1_{date_final}.xlsx')
276 156.0 MiB 0.0 MiB 1 wb.save(output_path)
277 156.0 MiB 0.0 MiB 1 wb.close()
278 156.0 MiB 0.0 MiB 2 messagebox.showinfo('Success', f'''M1 Created:
279 156.0 MiB 0.0 MiB 1 {output_path}''')
280 except Exception as e:
281 messagebox.showerror('Error', f'''Failed to create excel:
282 {str(e)}''')
< /code>
I did a minimal reproducible example and it looks like the problem is when i save the excel... I changed everything to merge and sql syntax and it looks faster but it breaks when saving, executing takes 8 seconds but saving the excel is breaking the machine lol

Подробнее здесь: https://stackoverflow.com/questions/796 ... -very-slow

1750156035

Anonymous

 Привет, я ищу сустейцы, чтобы сделать мой сценарий Python более эффективным. В конце года я ожидаю, что входной Excel удвоится, поэтому мне нужно найти способ улучшить обработку, чтобы сократить время проходов до [b], по крайней мере, [/b] 5 минут. 
Эта функция принимает шаблон (plantilla.df) со всеми колоннами и форматом, который я хочу для вывода Excel. Если столбцы такие же, как и столбцы шаблона, он просто копирует данные из входного файла в файл шаблона.def get_bos_values(Bos):
Bos = str(Bos).strip()
mapping_row = bos_mapping_df.loc[bos_mapping_df['BOS'] == Bos]
if not mapping_row.empty:
account_value = str(mapping_row.iloc[0]['ACCOUNT']).strip()
account_name_value = str(mapping_row.iloc[0]['ACCOUNT NAME']).strip()
return account_value, account_name_value
return None, None

def get_type(CdG):
CdG=str(CdG).strip()
for i, source_value in enumerate(type_mapping_df['Source_value']):
source_value = str(source_value).strip()
if source_value in CdG:

return type_mapping_df.iloc[i]['Target_value']
return None

def determine_inter_branch_flag(Type, Producto, group_code, interbusiness_mapping_df):
if not Type:
return None
elif Type not in ['Assets', 'Liabilities']:
return 'No - non Assets & Liability'
elif Type in ['Assets', 'Liabilities']:
if str(Producto).startswith('LIF'):
return 'No - LIF'
else:
if group_code in interbusiness_mapping_df['Source_value'].values:
return 'Yes'
else:
return 'No'
return None```
< /code>
def create_excel():
try:
output_df = plantilla_df.copy()
output_rows = []

for index, row in combined_m1.iterrows():
temp_row = {}
for col_name in plantilla_df.columns:
if col_name in combined_m1fmis.columns:
temp_row[col_name] = row[col_name]

elif col_name == 'CdG':
temp_row['CdG'], temp_row['CdG desc'] = get_bos_values(row['Cuenta'])

elif col_name == 'Type':
temp_row['Type'] = get_type(temp_row['CdG'])

elif col_name == 'inter-branch flag':
temp_row['inter-branch flag'] = determine_inter_branch_flag(temp_row['Type'], row['Product Reference'], row['Group Code'], interbusiness_mapping_df)

output_rows.append(temp_row)

output_df = pd.DataFrame(output_rows, columns=plantilla_df.columns)

template_directory = os.path.dirname(template_file)

output_path_temp = os.path.join(template_directory, f'M1_{date_final}.xlsx')
output_df.to_excel(output_path_temp, index=False)

except Exception as e:
messagebox.showerror("Error", f"Failed to create excel:\n{str(e)}")```
< /code>
Would it help to have all the mapping in the same df? Or to join all the mapping in one function...  Do you have any idea how to improve the processing time?
Data looks like this
template
ID ; Cuenta ; CdG ; CdG desc ; Type ; inter-branch flag ; Group Code; Product Reference...
(Everything is empty; NaN)

combined_m1
ID ; Cuenta ; StartDate; EndDate; Group Code; Product Reference
AA1; 01     ; 250612   ; 250920 ; B1        ; 101
AA2; 02     ; 250101   ; 250910 ; B2        ; 102
...

bos_mapping_df
BOS  ; ACCOUNT ; ACCOUNT NAME
01   ;  PG1    ; Debts
02   ;  PG2    ; Taxes

type_mapping_df
SOURCE_VALUE; TARGET_VALUE
PG1 ; Liability
PG2 ; Loss

Expected outcome
ID ; Cuenta ; CdG ; CdG desc ; Type ; inter-branch flag ; Group Code; Product Reference...
AA1; 01     ;  PG1    ; Debts; Liability; Yes;B1; 101;...
AA2; 02     ;  PG2    ; Taxes; Loss; No - non Assets & Liability; B2 ;  102;...
< /code>
Any suggestions are welcome!!
Profile does not work when using 150k lines excel, the only log i could get was using the sample data (21 lines)
=============================================================
228    155.8 MiB    155.8 MiB           1   @profile(stream=log_file)
229                                         def create_excel():
230    155.8 MiB      0.0 MiB           1       try:
231    155.8 MiB      0.0 MiB           1           output_df = plantilla_df.copy()
232    155.8 MiB      0.0 MiB           1           output_rows = []
233    155.9 MiB      0.0 MiB          21           for (index, row) in combined_m1fmis.iterrows():
234    155.8 MiB      0.0 MiB          20               temp_row = {}
235    155.8 MiB      0.0 MiB        1160               for col_name in plantilla_df.columns:
236    155.8 MiB      0.0 MiB        1140                   if (col_name in combined_m1.columns):
237    155.8 MiB      0.0 MiB         880                       temp_row[col_name] = row[col_name]
238    155.8 MiB      0.0 MiB         260                   elif (col_name == 'Branch'):
239    155.8 MiB      0.0 MiB          20                       temp_row['Branch'] = get_branch_value(row['Codigo'])
240    155.8 MiB      0.0 MiB         240                   elif (col_name == 'CdG'):
241    155.8 MiB      0.0 MiB          20                       (temp_row['CdG'], temp_row['CdG desc']) = get_bos_values(row['Cuenta'])
242    155.8 MiB      0.0 MiB         220                   elif (col_name == 'Type'):
243    155.8 MiB      0.0 MiB          20                       temp_row['Type'] = get_type(temp_row['CdG'])
244    155.8 MiB      0.0 MiB         200                   elif (col_name == 'BB code'):
245    155.8 MiB      0.0 MiB          20                       (temp_row['BB code'], temp_row['BB Desc']) = get_bb(temp_row['CdG'])
246    155.8 MiB      0.0 MiB         180                   elif (col_name == 'inter-branch flag'):
247    155.8 MiB      0.0 MiB          20                       temp_row['inter-branch flag'] = determine_inter_branch_flag(temp_row['Type'], row['Product Reference'], row['Group Code'], interbusiness_mapping_df)
248    155.8 MiB      0.0 MiB         160                   elif (col_name == 'local account'):
249    155.8 MiB      0.0 MiB          20                       account_number = str(row['Numero Cuenta'])
250    155.8 MiB      0.0 MiB          20                       temp_row['local account'] = account_number[4:(- 3)]
251    155.8 MiB      0.0 MiB         140                   elif (col_name == 'Customer Name'):
252    155.8 MiB      0.0 MiB          20                       temp_row['Customer Name'] = get_customer_Name(row['Cliente'])
253    155.8 MiB      0.0 MiB         120                   elif (col_name == 'Murex check'):
254    155.8 MiB      0.0 MiB          20                       temp_row['check'] = get_check(row['Folder'])
255    155.8 MiB      0.0 MiB         100                   elif (col_name == 'FX'):
256    155.8 MiB      0.0 MiB          20                       temp_row['FX'] = get_fx(row['Amount1'], row['Amount2'])
257    155.8 MiB      0.0 MiB          80                   elif (col_name == 'KEY 1'):
258    155.8 MiB      0.0 MiB          20                       temp_row['KEY 1'] = f"{temp_row['Branch']} - {row['Contrato']}"
259    155.8 MiB      0.0 MiB          60                   elif (col_name == 'KEY 2'):
260    155.8 MiB      0.0 MiB          20                       cuenta_value = (row['Ref'] if row['Ref'] else row['Product Reference'])
261    155.8 MiB      0.0 MiB          20                       temp_row['KEY 2'] = f"{temp_row['Branch']} - {cuenta_value}"
262    155.8 MiB      0.0 MiB          20               output_rows.append(temp_row)
263    155.9 MiB      0.1 MiB           1           output_df = pd.DataFrame(output_rows, columns=plantilla_df.columns)
264    155.9 MiB      0.0 MiB           1           template_directory = os.path.dirname(template_file)
265    155.9 MiB      0.0 MiB           1           output_path_temp = os.path.join(template_directory, f'M1_{date_final}.xlsx')
266    156.0 MiB      0.0 MiB           1           output_df.to_excel(output_path_temp,  index=False)
267    156.0 MiB      0.0 MiB           1           wb = load_workbook(output_path_temp)
268    156.0 MiB      0.0 MiB           1           ws = wb.active
269    156.0 MiB      0.0 MiB           1           plantilla_wb = load_workbook(template_file)
270    156.0 MiB      0.0 MiB           1           plantilla_ws = plantilla_wb['Template']
271    156.0 MiB      0.0 MiB          58           for col in range(1, (len(plantilla_df.columns) + 1)):
272    156.0 MiB      0.0 MiB          57               header_fill = plantilla_ws.cell(row=1, column=col).fill
273    156.0 MiB      0.0 MiB          57               new_fill = PatternFill(start_color=header_fill.start_color, end_color=header_fill.end_color, fill_type=header_fill.fill_type)
274    156.0 MiB      0.0 MiB          57               ws.cell(row=1, column=col).fill = new_fill
275    156.0 MiB      0.0 MiB           1           output_path = os.path.join(template_directory, f'M1_{date_final}.xlsx')
276    156.0 MiB      0.0 MiB           1           wb.save(output_path)
277    156.0 MiB      0.0 MiB           1           wb.close()
278    156.0 MiB      0.0 MiB           2           messagebox.showinfo('Success', f'''M1 Created:
279    156.0 MiB      0.0 MiB           1   {output_path}''')
280                                             except Exception as e:
281                                                 messagebox.showerror('Error', f'''Failed to create excel:
282                                         {str(e)}''')
< /code>
I did a minimal reproducible example and it looks like the problem is when i save the excel... I changed everything to merge and sql syntax and it looks faster but it breaks when saving, executing takes 8 seconds but saving the excel is breaking the machine lol
 

Подробнее здесь: [url]https://stackoverflow.com/questions/79666627/python-pd-data-frame-processing-very-slow[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Python PD -обработка DataFrame очень медленная

Последнее сообщение Anonymous « 15 июн 2025, 17:25
Добавлено в форуме Python

Anonymous » 15 июн 2025, 17:25 » в форуме Python

Привет, я ищу сустейцы, чтобы сделать мой сценарий Python более эффективным. В конце года я ожидаю, что входной Excel удвоится, поэтому мне нужно найти способ улучшить обработку, чтобы сократить время проходов до , по крайней мере, 5 минут.
Эта...

0 Ответы

2 Просмотры

Последнее сообщение Anonymous
15 июн 2025, 17:25
Python PD -обработка DataFrame очень медленная

Последнее сообщение Anonymous « 16 июн 2025, 19:33
Добавлено в форуме Python

Anonymous » 16 июн 2025, 19:33 » в форуме Python

Привет, я ищу сустейцы, чтобы сделать мой сценарий Python более эффективным. В конце года я ожидаю, что входной Excel удвоится, поэтому мне нужно найти способ улучшить обработку, чтобы сократить время проходов до , по крайней мере, 5 минут.
Эта...

0 Ответы

3 Просмотры

Последнее сообщение Anonymous
16 июн 2025, 19:33
Установка библиотек Python в Google Drive в Colab очень медленная или переустанавливается

Последнее сообщение Anonymous « 31 авг 2025, 23:54
Добавлено в форуме Python

Anonymous » 31 авг 2025, 23:54 » в форуме Python

Скриншот установки

colab link
его было 25 минут. Мой рабочий процесс Google Colab, установив библиотеки Python непосредственно на мой Google Drive, используя флаг -target . Моя цель состоит в том, чтобы избежать переустановки библиотек каждый...

0 Ответы

31 Просмотры

Последнее сообщение Anonymous
31 авг 2025, 23:54
Производительность Pandas to_sql очень медленная

Последнее сообщение Anonymous « 19 окт 2023, 10:43
Добавлено в форуме Python

Anonymous » 19 окт 2023, 10:43 » в форуме Python

Я пытаюсь вставить фрейм данных pandas с 200 столбцами на сервер mssql через свой локальный компьютер. Я создаю и использую SQL-соединение, как показано ниже:

engine = sqlalchemy.create_engine( mssql+pyodbc:///?odbc_connect=%s % params,...

0 Ответы

38 Просмотры

Последнее сообщение Anonymous
19 окт 2023, 10:43
Вертикальная прокрутка lazyColumn очень медленная.

Последнее сообщение Anonymous « 01 май 2024, 20:34
Добавлено в форуме Android

Anonymous » 01 май 2024, 20:34 » в форуме Android

Он настолько медленный, что его практически невозможно использовать. Сборка выпуска не имеет большого значения, если только размер внешних списков не равен примерно 2.
Попытка: мои поиски не дали ничего, что я мог бы понять. Я пытался найти проблемы...

0 Ответы

44 Просмотры

Последнее сообщение Anonymous
01 май 2024, 20:34

Вернуться в «Python»