Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Цитата

Сообщение Anonymous » 31 мар 2025, 12:44

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = "s-table">

gene1 < /th>
gene2 < /th>
оценка < /th>
< /tr>
< /thead>

< /thead>

/>Gene9123
0.999706

Gene5219
Gene9161
0.999691
< /tr>

gene27 < /td>
gene6467 < /td>
td> >0.9964646
< /tr>

gene3255 < /td>

gene3255 < /tr>

gene3255 < /tr /> Gene4865 < /td>
0.999636
< /tr>

gene2512 < /td>
gene5730 < /td>
> 0,996605605605605605605605660560560560560560560560560. /> < /tr>

...> used_genes

id
name
used_genes

1 < /td>
комплекс 1 < /td>
[gene3629, gene8048, gene9660, gene4180, gene1 ...] < /td>
< /tr>

< /td> < /td> < /td> < /td> /> комплекс 2 < /td>
[Gene3944, Gene931, Gene3769, Gene7523, Gene61 ...] < /td>
< /tr>

3 < /td>
. /> [Gene8236, Gene934, Gene5902, Gene165, Gene664 ...] < /td>
< /tr>

4 < /td>
комплекс 4 < /td>
[gene299, gene299, gene299, gene299, gene299, gene299, gene299, < /td>
[gene299, . Gene8932, Gene6670, Gene2 ...] < /td>
< /tr>

5 < /td>
комплекс 5 < /td>
[gene3860, gene5792, gene9214, gene717, gene3860, gene5792, gene9214, gene717. /> < /tr>
< /tbody>
< /table> < /div>
Что я делаю: < /p>

ITeraing каждой золотой стандартной строки. />etc.
< /li>
Проверьте эти пары генов сложных строк в сложенных парах корреляции. Точность, отзыв и область под
Оценка кривой. < /p>
< /li>
< /ul>

< /th>

/> auc_score < /th>
< /tr>
< /thead>

MultiSubunit Actr Coactivator Complex < /td>
[CREBBP, KAT2B, NCOA3, ep300] < /> [BREBBP, KAT2B, NCOA3, EP300] />0.001695
< /tr>

Condensin I Complex < /td>
[Smc4, NCAPH, SMC2, NCAPG, NCAPD2] < /td>
функциональный />
BLOC-2 (biogenesis of lysosome-related organel...)
[HPS3, HPS5, HPS6]
0.000529

NCOR complex
[TBL1XR1, NCOR1, TBL1X, GPS2, HDAC3, CORO2A] < /td>
td> >0.000839
< /tr>

bloc-1 (биогенез лизосома /> [dtnbp1, snapin, bloc1s6, bloc1s1, bloc1s5, bl ...] < /td>
td> >0.002227
< /tr>
< /tbody>
< /table> < /div>
так, что в конце, для каждого, для каждого, для каждым, для каждым, для каждым, для каждым, для каждого, для каждого, для каждого, для каждого, я Оценка PR-AUC. < /P>
Я поделюсь своей функцией ниже, с 100-метровыми сложенными DF и 5K терминами, которые он занимает около 25 минут, и я пытаюсь найти способ уменьшить время. Я думаю, проблема в итерации. < /P>

Код: Выделить всё

from sklearn import metrics
def compute_per_complex_pr(corr_df, terms_df):

pairwise_df = binary(corr_df)
pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)

# Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
gene_to_pair_indices = {}
for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
gene_to_pair_indices.setdefault(gene_a, []).append(i)
gene_to_pair_indices.setdefault(gene_b, []).append(i)

# Initialize AUC scores (one for each complex) with NaNs.
auc_scores = np.full(len(terms_df), np.nan)

# Loop over each gene complex
for idx, row in terms_df.iterrows():
gene_set = set(row.used_genes)

# Collect all row indices in the pairwise data where either gene belongs to the complex.
candidate_indices = set()
for gene in gene_set:
candidate_indices.update(gene_to_pair_indices.get(gene, []))
candidate_indices = sorted(candidate_indices)

if not candidate_indices:
continue

# Select only the relevant pairwise comparisons.
sub_df = pairwise_df.loc[candidate_indices]
# A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)

if predictions.sum() == 0:
continue

# Compute cumulative true positives and derive precision and recall.
true_positive_cumsum = predictions.cumsum()
precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]

if len(recall) < 2 or recall.iloc[-1] == 0:
continue

auc_scores[idx] = metrics.auc(recall, precision)

# Add the computed AUC scores to the terms DataFrame.
terms_df["auc_score"] = auc_scores
return terms_df

def binary(corr):
stack = corr.stack().rename_axis(index=['gene1', 'gene2']).reset_index(name='score')
stack = drop_mirror_pairs(stack)
return stack

def quick_sort(df, ascending=False):
order = 1 if ascending else -1
sorted_df = df.iloc[np.argsort(order * df["score"].values)].reset_index(drop=True)
return sorted_df

def drop_mirror_pairs(df):
gene_pairs = np.sort(df[["gene1", "gene2"]].to_numpy(), axis=1)
df.loc[:, ["gene1", "gene2"]] = gene_pairs
df = df.loc[~df.duplicated(subset=["gene1", "gene2"], keep="first")]
return df
< /code>
для фиктивных данных (Corr Matrix, terms_df) < /p>
import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(0)

# -------------------------------
# Create the 10,000 x 10,000 correlation matrix
# -------------------------------
num_genes = 10000
genes = [f"Gene{i}" for i in range(num_genes)]

rand_matrix = np.random.uniform(-1, 1, (num_genes, num_genes))
corr_matrix = (rand_matrix + rand_matrix.T) / 2
np.fill_diagonal(corr_matrix, 1.0)

corr_df = pd.DataFrame(corr_matrix, index=genes, columns=genes)

num_terms = 5000
terms_list = []

for i in range(1, num_terms + 1):
# Randomly choose a number of genes between 10 and 40 for this term
n_genes = np.random.randint(10, 41)
used_genes = np.random.choice(genes, size=n_genes, replace=False).tolist()
term = {
"id": i,
"name": f"Complex {i}",
"used_genes": used_genes
}
terms_list.append(term)

terms_df = pd.DataFrame(terms_list)

# Display sample outputs (for verification, you might want to show the first few rows)
print("Correlation Matrix Sample:")
print(corr_df.iloc[:5, :5])  # print a 5x5 sample

print("\nTerms DataFrame Sample:")
print(terms_df.head())

для запуска функции compute_per_complex_pr (corr_df, termer_df)

Подробнее здесь: https://stackoverflow.com/questions/795 ... -dataframe

1743414295

Anonymous

 Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = "s-table">


 gene1 < /th>
 gene2 < /th>
 оценка < /th>
< /tr>
< /thead>

           

  

< /thead>


/>Gene9123
0.999706


Gene5219
Gene9161
0.999691
< /tr>

 gene27 < /td>
 gene6467 < /td>
td> >0.9964646
< /tr>

 gene3255 < /td>

 gene3255 < /tr>

 gene3255 < /tr />  Gene4865 < /td>
0.999636
< /tr>

 gene2512 < /td>
 gene5730 < /td>
> 0,996605605605605605605605660560560560560560560560560. /> < /tr>

...> used_genes



id
name
used_genes




 1 < /td>
 комплекс 1 < /td>
 [gene3629, gene8048, gene9660, gene4180, gene1 ...] < /td>
< /tr>

 < /td> < /td>  < /td>  < /td> />  комплекс 2 < /td>
 [Gene3944, Gene931, Gene3769, Gene7523, Gene61 ...] < /td>
< /tr>

 3 < /td>
. />  [Gene8236, Gene934, Gene5902, Gene165, Gene664 ...] < /td>
< /tr>

 4 < /td>
 комплекс 4 < /td>
 [gene299, gene299, gene299, gene299, gene299, gene299, gene299,  < /td>
 [gene299, . Gene8932, Gene6670, Gene2 ...] < /td>
< /tr>

 5 < /td>
 комплекс 5 < /td>
 [gene3860, gene5792, gene9214, gene717, gene3860, gene5792, gene9214, gene717. /> < /tr>
< /tbody>
< /table> < /div>
Что я делаю: < /p>

  ITeraing каждой золотой стандартной строки. />etc.
< /li>
  Проверьте эти пары генов сложных строк в сложенных парах корреляции. Точность, отзыв и область под
Оценка кривой. < /p>
< /li>
< /ul>
 


 < /th>

 />  auc_score < /th>
< /tr>
< /thead>


 MultiSubunit Actr Coactivator Complex < /td>
 [CREBBP, KAT2B, NCOA3, ep300] < />  [BREBBP, KAT2B, NCOA3, EP300] />0.001695
< /tr>

 Condensin I Complex < /td>
 [Smc4, NCAPH, SMC2, NCAPG, NCAPD2] < /td>
функциональный />
BLOC-2 (biogenesis of lysosome-related organel...)
[HPS3, HPS5, HPS6]
0.000529


NCOR complex
 [TBL1XR1, NCOR1, TBL1X, GPS2, HDAC3, CORO2A] < /td>
td> >0.000839
< /tr>

 bloc-1 (биогенез лизосома />  [dtnbp1, snapin, bloc1s6, bloc1s1, bloc1s5, bl ...] < /td>
td> >0.002227
< /tr>
< /tbody>
< /table> < /div>
 так, что в конце, для каждого, для каждого, для каждым, для каждым, для каждым, для каждым, для каждого, для каждого, для каждого, для каждого, я Оценка PR-AUC. < /P>
Я поделюсь своей функцией ниже, с 100-метровыми сложенными DF и 5K терминами, которые он занимает около 25 минут, и я пытаюсь найти способ уменьшить время. Я думаю, проблема в итерации.  < /P>
[code]from sklearn import metrics
def compute_per_complex_pr(corr_df, terms_df):

pairwise_df = binary(corr_df)
pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)

# Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
gene_to_pair_indices = {}
for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
gene_to_pair_indices.setdefault(gene_a, []).append(i)
gene_to_pair_indices.setdefault(gene_b, []).append(i)

# Initialize AUC scores (one for each complex) with NaNs.
auc_scores = np.full(len(terms_df), np.nan)

# Loop over each gene complex
for idx, row in terms_df.iterrows():
gene_set = set(row.used_genes)

# Collect all row indices in the pairwise data where either gene belongs to the complex.
candidate_indices = set()
for gene in gene_set:
candidate_indices.update(gene_to_pair_indices.get(gene, []))
candidate_indices = sorted(candidate_indices)

if not candidate_indices:
continue

# Select only the relevant pairwise comparisons.
sub_df = pairwise_df.loc[candidate_indices]
# A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)

if predictions.sum() == 0:
continue

# Compute cumulative true positives and derive precision and recall.
true_positive_cumsum = predictions.cumsum()
precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]

if len(recall) < 2 or recall.iloc[-1] == 0:
continue

auc_scores[idx] = metrics.auc(recall, precision)

# Add the computed AUC scores to the terms DataFrame.
terms_df["auc_score"] = auc_scores
return terms_df

def binary(corr):
stack = corr.stack().rename_axis(index=['gene1', 'gene2']).reset_index(name='score')
stack = drop_mirror_pairs(stack)
return stack

def quick_sort(df, ascending=False):
order = 1 if ascending else -1
sorted_df = df.iloc[np.argsort(order * df["score"].values)].reset_index(drop=True)
return sorted_df

def drop_mirror_pairs(df):
gene_pairs = np.sort(df[["gene1", "gene2"]].to_numpy(), axis=1)
df.loc[:, ["gene1", "gene2"]] = gene_pairs
df = df.loc[~df.duplicated(subset=["gene1", "gene2"], keep="first")]
return df
< /code>
для фиктивных данных (Corr Matrix, terms_df) < /p>
import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(0)

# -------------------------------
# Create the 10,000 x 10,000 correlation matrix
# -------------------------------
num_genes = 10000
genes = [f"Gene{i}" for i in range(num_genes)]

rand_matrix = np.random.uniform(-1, 1, (num_genes, num_genes))
corr_matrix = (rand_matrix + rand_matrix.T) / 2
np.fill_diagonal(corr_matrix, 1.0)

corr_df = pd.DataFrame(corr_matrix, index=genes, columns=genes)

num_terms = 5000
terms_list = []

for i in range(1, num_terms + 1):
# Randomly choose a number of genes between 10 and 40 for this term
n_genes = np.random.randint(10, 41)
used_genes = np.random.choice(genes, size=n_genes, replace=False).tolist()
term = {
"id": i,
"name": f"Complex {i}",
"used_genes": used_genes
}
terms_list.append(term)

terms_df = pd.DataFrame(terms_list)

# Display sample outputs (for verification, you might want to show the first few rows)
print("Correlation Matrix Sample:")
print(corr_df.iloc[:5, :5])  # print a 5x5 sample

print("\nTerms DataFrame Sample:")
print(terms_df.head())
[/code]
для запуска функции compute_per_complex_pr (corr_df, termer_df)   

Подробнее здесь: [url]https://stackoverflow.com/questions/79544423/fastest-way-to-search-5k-rows-inside-of-100m-row-pair-wise-dataframe[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Последнее сообщение Anonymous « 30 мар 2025, 15:03
Добавлено в форуме Python

Anonymous » 30 мар 2025, 15:03 » в форуме Python

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. src =
Тогда у меня есть стандартные термины с золотом, около 5K Rows, а столбцы-это идентификатор и используется_genes

Что я делаю:
Iteaing каждой сложной строки...

0 Ответы

12 Просмотры

Последнее сообщение Anonymous
30 мар 2025, 15:03
Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Последнее сообщение Anonymous « 30 мар 2025, 19:39
Добавлено в форуме Python

Anonymous » 30 мар 2025, 19:39 » в форуме Python

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = s-table >

gene1
gene2
оценка

/>Gene9123
0.999706

Gene5219
Gene9161
0.999691

gene27
gene6467
td> >0.9964646

gene3255...

0 Ответы

4 Просмотры

Последнее сообщение Anonymous
30 мар 2025, 19:39
Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Последнее сообщение Anonymous « 30 мар 2025, 22:03
Добавлено в форуме Python

Anonymous » 30 мар 2025, 22:03 » в форуме Python

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = s-table >

gene1
gene2
оценка

/>Gene9123
0.999706

Gene5219
Gene9161
0.999691

gene27
gene6467
td> >0.9964646

gene3255...

0 Ответы

8 Просмотры

Последнее сообщение Anonymous
30 мар 2025, 22:03
Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Последнее сообщение Anonymous « 02 апр 2025, 19:54
Добавлено в форуме Python

Anonymous » 02 апр 2025, 19:54 » в форуме Python

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = s-table >

gene1
gene2
оценка

/>Gene9123
0.999706

Gene5219
Gene9161
0.999691

gene27
gene6467
td> >0.9964646

gene3255...

0 Ответы

7 Просмотры

Последнее сообщение Anonymous
02 апр 2025, 19:54
Самый быстрый способ искать 5K Rows внутри 100-метровой парной пары DataFrame

Последнее сообщение Anonymous « 03 апр 2025, 01:04
Добавлено в форуме Python

Anonymous » 03 апр 2025, 01:04 » в форуме Python

Я не уверен, что название хорошо описывает проблему, но я объясню ее шаг за шагом. class = s-table >

gene1
gene2
оценка

/>Gene9123
0.999706

Gene5219
Gene9161
0.999691

gene27
gene6467
td> >0.9964646

gene3255...

0 Ответы

8 Просмотры

Последнее сообщение Anonymous
03 апр 2025, 01:04

Вернуться в «Python»