Pandas замедляется после обработки 10 000 строк

Pandas замедляется после обработки 10 000 строк ⇐ Python

Ответить

1 сообщение • Страница 1 из 1

Anonymous

Pandas замедляется после обработки 10 000 строк

Цитата

Сообщение Anonymous » 08 ноя 2024, 04:13

Я работаю над небольшой функцией для простой очистки CSV-файла с помощью Pandas. Вот код:

Код: Выделить всё

def clean_charges(conn, cur):
charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate',
'PrimaryInsurancePaymentPostingDate',
'SecondaryInsurancePaymentPostingDate',
'TertiaryInsurancePaymentPostingDate'])

# Split charges into 10 equal sized dataframes
num_splits = 10
charges_split = np.array_split(charges, num_splits)

cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())

count = 0
total = 0
for cur_charge in charges_split:
for index, charge in cur_charge.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= cur_month:
count += 1
charges.drop(index, inplace=True)
continue
# Delete the payments if they were applied in the current month
if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
charge['PrimaryInsurancePayment'] = 0
if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
charge['SecondaryInsurancePayment'] = 0
if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
charge['TertiaryInsurancePayment'] = 0
# Delete duplicate payments
if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] +
charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
charge['SecondaryInsurancePayment'] = 0

charges = pd.concat(charges_split)

charges.to_csv('csv/updated_charges.csv', index=False)

Общий размер файла all_charges.csv составляет около 270 000 строк, но я столкнулся с проблемой, из-за которой он обрабатывает первые 10 000 строк очень быстро, а затем значительно замедляется. Примерное время составляет 5 секунд для первых 10 000, а затем около 2 минут для каждой последующей тысячи. Это была проблема, когда я работал над полным набором как над одним фреймом данных и когда я разделил его на 10, как вы можете видеть сейчас в моем коде. Я не вижу ничего, что могло бы стать причиной этого, возможно, мой код не оптимизирован на 100%, но я чувствую, что не делаю ничего невероятно глупого. Мой компьютер также использует только 15 % загрузки ЦП и 40 % памяти, поэтому я не думаю, что это аппаратная проблема.
Я буду признателен за любую помощь, которую смогу выяснить. почему это работает так медленно!

Подробнее здесь: https://stackoverflow.com/questions/791 ... 0-000-rows

1731028424

Anonymous

Я работаю над небольшой функцией для простой очистки CSV-файла с помощью Pandas. Вот код:
[code]def clean_charges(conn, cur):
charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate',
'PrimaryInsurancePaymentPostingDate',
'SecondaryInsurancePaymentPostingDate',
'TertiaryInsurancePaymentPostingDate'])

# Split charges into 10 equal sized dataframes
num_splits = 10
charges_split = np.array_split(charges, num_splits)

cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())

count = 0
total = 0
for cur_charge in charges_split:
for index, charge in cur_charge.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= cur_month:
count += 1
charges.drop(index, inplace=True)
continue
# Delete the payments if they were applied in the current month
if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
charge['PrimaryInsurancePayment'] = 0
if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
charge['SecondaryInsurancePayment'] = 0
if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
charge['TertiaryInsurancePayment'] = 0
# Delete duplicate payments
if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] +
charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
charge['SecondaryInsurancePayment'] = 0

charges = pd.concat(charges_split)

charges.to_csv('csv/updated_charges.csv', index=False)
[/code]
Общий размер файла all_charges.csv составляет около 270 000 строк, но я столкнулся с проблемой, из-за которой он обрабатывает первые 10 000 строк очень быстро, а затем значительно замедляется. Примерное время составляет 5 секунд для первых 10 000, а затем около 2 минут для каждой последующей тысячи. Это была проблема, когда я работал над полным набором как над одним фреймом данных и когда я разделил его на 10, как вы можете видеть сейчас в моем коде. Я не вижу ничего, что могло бы стать причиной этого, возможно, мой код не оптимизирован на 100%, но я чувствую, что не делаю ничего невероятно глупого. Мой компьютер также использует только 15 % загрузки ЦП и 40 % памяти, поэтому я не думаю, что это аппаратная проблема.
Я буду признателен за любую помощь, которую смогу выяснить. почему это работает так медленно! 

Подробнее здесь: [url]https://stackoverflow.com/questions/79168379/pandas-slowing-way-down-after-processing-10-000-rows[/url]