Встроенная функция Pandas для уровня ячейки применяется с учетом индекса/столбца? - Цифровое Кемерово

Встроенная функция Pandas для уровня ячейки применяется с учетом индекса/столбца? ⇐ Python

Ответить

1 сообщение • Страница 1 из 1

Anonymous

Встроенная функция Pandas для уровня ячейки применяется с учетом индекса/столбца?

Цитата

Сообщение Anonymous » 25 ноя 2024, 23:03

Я очищаю исторические данные для прогнозирования экспоненциального сглаживания. У меня есть данные на уровне округа США (т. е. административной единицы второго уровня), но там много нулевых значений (из-за малого объема), что приводит к проблемам с моделью прогнозирования.
Поскольку данные сильно сезонны, я проверяю данные каждого округа за каждый год. Если данные за конкретный год содержат нули, я заменяю данные округа за этот год скорректированным набором данных, который применяет сезонность на уровне штата к годовому объему на уровне округа.
После долгих испытаний и ошибка, чтобы избежать итерации, которая привела меня к вложенным функциям apply с индексами сброса (например, df.apply(lambda x: x.reset_index().apply(lambda y: [расчет])), в конечном итоге я написал очистку данных, используя итерацию, которая вычисляет сезонность, а затем умножает сезонные данные на фрейм данных, в котором годовой объем хранится в ежемесячных столбцах:
# Initialize empty seasonality df with the index and column values from the source data
cty_season = pd.DataFrame(index=cty_data.index, columns=cty_data.columns)

# Iterate through the index and columns to populate each value
for idx in cty_season.index:
for col in cty_season.columns:
cty_season.loc[idx,col] = [calculation referring to helper dfs with identical indices and columns]

# Combine seasonality data with sales totals to get revised dataset
cty_adj = cty_season * cty_annual

Есть ли способ сделать это более эффективно или более «пандично» (или что-то в этом роде, эквивалентное Pythonic в Pandas)? Единственное, что приходит на ум, — это разбить столбцы так, чтобы каждый год представлял собой отдельную строку, что могло бы позволить более простой оператор apply, поскольку замена выполняется каждый год.

Редактировать: вот пример процедуры очистки данных. Как я предлагал выше, для этого конкретного варианта использования ответом, вероятно, будет выделение каждого года в отдельную строку. Однако я сталкивался с этим сценарием в других случаях использования, которые могли не иметь такого же решения. Единственное отличие этого кода заключается в том, что я обычно сводю записи о продажах для создания кадра данных, поэтому значения NaN уже находятся в df, вместо того, чтобы в этом примере мне приходилось заменять 0 на NaN.
import pandas as pd
import numpy as np

data = [[73, 0, 0, 22, 0, 34, 5, 46],
[51, 12, 77, 0, 19, 3, 0, 34],
[73, 44, 1, 72, 0, 56, 21, 3],
[ 3, 74, 2, 24, 4, 60, 8, 39],
[70, 0, 36, 50, 3, 1, 59, 1],
[14, 37, 26, 27, 87, 58, 95, 2],
[ 4, 1, 17, 34, 25, 1, 1, 2],
[ 0, 0, 0, 4, 18, 1, 8, 0],
[42, 27, 41, 15, 67, 2, 25, 6]]

df = pd.DataFrame(data,
index=pd.MultiIndex.from_product([['County 1','County 2','County 3'],['A','B','C']],names=['County','Product']),
columns=pd.Series(['Y1Q1','Y1Q2','Y1Q3','Y1Q4','Y2Q1','Y2Q2','Y2Q3','Y2Q4'],name='Quarter'))

# Roll up totals by product
tot_df = df.groupby('Product').sum()

# Find out how many non-zero data points there should be per year
# (this is done to allow for YTD analysis instead of assuming each year should have 4 quarterly points or 12 monthly points)
# There is an assumption that the data doesn't have any zeroes at the total level
tot_values = tot_df.apply(lambda x: x.groupby(x.index.str[:2]).count(),axis=1)

# Calculate seasonality/share of year for each product, each year
tot_season = tot_df.apply(lambda x: x.reset_index().apply(lambda y: y[x.name]/x[x.index.str[:2]==y.Quarter[:2]].sum(),axis=1),axis=1)
tot_season.columns = tot_df.columns

# Look for zeroes to determine if the data for a particular product and year can be used
cty_valid = df.replace({0:np.nan}).apply(lambda x: x.groupby(x.index.str[:2]).count().eq(tot_values.loc[x.name[-1]]),axis=1)

# Total up annual numbers by county/product.
# These numbers are repeated at the quarterly level so that the annual data
# can be directly multiplied with the county seasonality to be generated
cty_annual = df.apply(lambda x: x.groupby(x.index.str[:2]).sum(),axis=1)
cty_annual.columns = [x + 'Q1' for x in cty_annual.columns]
cty_annual = cty_annual.reindex(columns=df.columns).ffill(axis=1)

# Create a dataframe with the needed index and columns
cty_season = pd.DataFrame(index=df.index,columns=df.columns)

# Iterate through each county/product and period combination to populate the dataframe
for idx in cty_season.index:
for col in cty_season.columns:
# Use the actual seasonality (period sales / annual sales) if the year has non-zero values for that product/county.
# If not, use the seasonality calculated at the total level for that product
cty_season.loc[idx,col] = df.loc[idx,col]/cty_annual.loc[idx,col] if cty_valid.loc[idx,col[:2]] else tot_season.loc[idx[-1],col]

# Multiply the seasonality df with the annual sales df to get an adjusted sales history.
cty_adj = cty_season * cty_annual

Подробнее здесь: https://stackoverflow.com/questions/792 ... -awareness

1732565004

Anonymous

Я очищаю исторические данные для прогнозирования экспоненциального сглаживания. У меня есть данные на уровне округа США (т. е. административной единицы второго уровня), но там много нулевых значений (из-за малого объема), что приводит к проблемам с моделью прогнозирования.
Поскольку данные сильно сезонны, я проверяю данные каждого округа за каждый год. Если данные за конкретный год содержат нули, я заменяю данные округа за этот год скорректированным набором данных, который применяет сезонность на уровне штата к годовому объему на уровне округа.
После долгих испытаний и ошибка, чтобы избежать итерации, которая привела меня к вложенным функциям apply с индексами сброса (например, df.apply(lambda x: x.reset_index().apply(lambda y: [расчет])), в конечном итоге я написал очистку данных, используя итерацию, которая вычисляет сезонность, а затем умножает сезонные данные на фрейм данных, в котором годовой объем хранится в ежемесячных столбцах:
# Initialize empty seasonality df with the index and column values from the source data
cty_season = pd.DataFrame(index=cty_data.index, columns=cty_data.columns)

# Iterate through the index and columns to populate each value
for idx in cty_season.index:
for col in cty_season.columns:
cty_season.loc[idx,col] = [calculation referring to helper dfs with identical indices and columns]

# Combine seasonality data with sales totals to get revised dataset
cty_adj = cty_season * cty_annual

Есть ли способ сделать это более эффективно или более «пандично» (или что-то в этом роде, эквивалентное Pythonic в Pandas)? Единственное, что приходит на ум, — это разбить столбцы так, чтобы каждый год представлял собой отдельную строку, что могло бы позволить более простой оператор apply, поскольку замена выполняется каждый год.

Редактировать: вот пример процедуры очистки данных. Как я предлагал выше, для этого конкретного варианта использования ответом, вероятно, будет выделение каждого года в отдельную строку. Однако я сталкивался с этим сценарием в других случаях использования, которые могли не иметь такого же решения.  Единственное отличие этого кода заключается в том, что я обычно сводю записи о продажах для создания кадра данных, поэтому значения NaN уже находятся в df, вместо того, чтобы в этом примере мне приходилось заменять 0 на NaN.
import pandas as pd
import numpy as np

data = [[73,  0,  0, 22,  0, 34,  5, 46],
[51, 12, 77,  0, 19,  3,  0, 34],
[73, 44,  1, 72,  0, 56, 21,  3],
[ 3, 74,  2, 24,  4, 60,  8, 39],
[70,  0, 36, 50,  3,  1, 59,  1],
[14, 37, 26, 27, 87, 58, 95,  2],
[ 4,  1, 17, 34, 25,  1,  1,  2],
[ 0,  0,  0,  4, 18,  1,  8,  0],
[42, 27, 41, 15, 67,  2, 25,  6]]

df = pd.DataFrame(data,
index=pd.MultiIndex.from_product([['County 1','County 2','County 3'],['A','B','C']],names=['County','Product']),
columns=pd.Series(['Y1Q1','Y1Q2','Y1Q3','Y1Q4','Y2Q1','Y2Q2','Y2Q3','Y2Q4'],name='Quarter'))

# Roll up totals by product
tot_df = df.groupby('Product').sum()

# Find out how many non-zero data points there should be per year
# (this is done to allow for YTD analysis instead of assuming each year should have 4 quarterly points or 12 monthly points)
# There is an assumption that the data doesn't have any zeroes at the total level
tot_values = tot_df.apply(lambda x: x.groupby(x.index.str[:2]).count(),axis=1)

# Calculate seasonality/share of year for each product, each year
tot_season = tot_df.apply(lambda x: x.reset_index().apply(lambda y: y[x.name]/x[x.index.str[:2]==y.Quarter[:2]].sum(),axis=1),axis=1)
tot_season.columns = tot_df.columns

# Look for zeroes to determine if the data for a particular product and year can be used
cty_valid = df.replace({0:np.nan}).apply(lambda x: x.groupby(x.index.str[:2]).count().eq(tot_values.loc[x.name[-1]]),axis=1)

# Total up annual numbers by county/product.
# These numbers are repeated at the quarterly level so that the annual data
# can be directly multiplied with the county seasonality to be generated
cty_annual = df.apply(lambda x: x.groupby(x.index.str[:2]).sum(),axis=1)
cty_annual.columns = [x + 'Q1' for x in cty_annual.columns]
cty_annual = cty_annual.reindex(columns=df.columns).ffill(axis=1)

# Create a dataframe with the needed index and columns
cty_season = pd.DataFrame(index=df.index,columns=df.columns)

# Iterate through each county/product and period combination to populate the dataframe
for idx in cty_season.index:
for col in cty_season.columns:
# Use the actual seasonality (period sales / annual sales) if the year has non-zero values for that product/county.
# If not, use the seasonality calculated at the total level for that product
cty_season.loc[idx,col] = df.loc[idx,col]/cty_annual.loc[idx,col] if cty_valid.loc[idx,col[:2]] else tot_season.loc[idx[-1],col]

# Multiply the seasonality df with the annual sales df to get an adjusted sales history.
cty_adj = cty_season * cty_annual
 

Подробнее здесь: [url]https://stackoverflow.com/questions/79215537/built-in-pandas-function-for-cell-level-apply-with-index-column-awareness[/url]

Ответить

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Вернуться в «Python»