Проблемы с загрузкой больших данных из базы данных MSSQL — Kubeflow/Python + Pandas + Dask

Проблемы с загрузкой больших данных из базы данных MSSQL — Kubeflow/Python + Pandas + Dask ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Проблемы с загрузкой больших данных из базы данных MSSQL — Kubeflow/Python + Pandas + Dask

Цитата

Сообщение Anonymous » 11 янв 2025, 21:48

Я хотел бы воспользоваться возможностями сообщества, чтобы решить мою текущую проблему с загрузкой больших данных/таблиц (около 5 000 000 строк) из базы данных MSSQL.
настройка (на которую я не могу повлиять):

0 GPU
4000 CPU< /li>
15,0 Ги памяти

Мой код SQL хранится как файл .sql в папке проекта.
Я начал с куска из 500 000 строк, но это привело к сбою ядра. Пробовал 250.000, результат тот же. Сейчас на 100 000, но все равно происходит сбой.
В соответствии с правилами компании мне необходимо выполнить первоначальное подключение к база данных, как показано ниже, которая работает:

Код: Выделить всё

# Connection to MSSQL with Kerberos + pyodbc

def mssql_conn_kerberos(server, driver, trusted_connection, trust_server_certificate, kerberos_cmd):
# Run Kerberos for authentifications
os.system(kerberos_cmd)

try:
# First connection attempt
c_conn = pyodbc.connect(
f'DRIVER={driver};'
f'SERVER={server};'
f'Trusted_Connection={trusted_connection};'
f'TrustServerCertificate={trust_server_certificate}'
)
except:
# Re-run Kerberos and try authentification
os.system(kerberos_cmd)
c_conn = pyodbc.connect(
f"DRIVER={driver};"
f"SERVER={server};"
f"Trusted_Connection={trusted_connection};"
f"TrustServerCertificate={trust_server_certificate}"
)

c_cursor = c_conn.cursor()

print("Pyodbc connection ready.")

return c_conn # Connection to the database

Тогда у меня есть функция для чтения и обработки моего SQL-запроса (который находится в файле .sql, сохраненном в папке проекта):

Код: Выделить всё

def call_my_query(path_to_query, query_name, chunk, connection):

file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query = file.read()

# SQL processing in chunks + time
chunks = []
start_time = time.time()

for x in pd.read_sql_query(query, connection, chunksize=chunk):
chunks.append(x)

# Concating the chungs - joining all the chunks together
df = pd.concat(chunks, ignore_index=True)

# Process end-time
end_time = time.time()

print("Data loaded successfully!")
print(f'Processed {len(df)} rows in {end_time - start_time:.2f} seconds')

return df

Я также пытался запустить эту задачу через Dask, изменив функцию "call_my_query", но по какой-то причине Dask вызывает проблемы с pyodbc.
Альтернация "call_my_query" для Dask:

Код: Выделить всё

def call_my_query_dask(query_name, chunk, connection, index_col):

# Load query from file
file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query_original = file.read()

# Convert the SQL string/text
query = sqlalchemy.select(query_original)

# Start timing the process
start_time = time.time()

# Use Dask to read the SQL query in chunks
print("Executing query and loading data with Dask...")
df_dask = dd.read_sql_query(
sql=query,
con=connection_url,
npartitions=10,
index_col = index_col
)

# Process end-time
end_time = time.time()
print("Data loaded successfully!")
print(f"Processed approximately {df_dask.shape[0].compute()} rows in {end_time - start_time:.2f} seconds")

return df_dask

Что вызывает эту ошибку:

Код: Выделить всё

Textual column expression 'SELECT\n\t[COL1]\n\t, [COL...' should be explicitly declared with text('SELECT\n\t[COL1]\n\t, [COL...'), or use literal_column('SELECT\n\t[COL1]\n\t, [COL...') for more specificity

Возможно, решение простое, и я просто смотрю в другом направлении.
Спасибо всем за любые советы, подсказки или потенциальные решения!

Подробнее здесь: https://stackoverflow.com/questions/793 ... pandas-das

1736621331

Anonymous

Я хотел бы воспользоваться возможностями сообщества, чтобы решить мою текущую проблему с загрузкой больших данных/таблиц (около 5 000 000 строк) из базы данных MSSQL.
[b]настройка[/b] (на которую я не могу повлиять):
[list]
[*]0 GPU
[*]4000 CPU< /li>
15,0 Ги памяти
[/list]
Мой [b]код SQL[/b] хранится как [b]файл .sql[/b] в папке проекта.
Я начал с [b]куска[/b] из [b]500 000 строк[/b], но это привело к сбою ядра. Пробовал 250.000, результат тот же. Сейчас на [b]100 000, но все равно происходит сбой[/b].
В соответствии с правилами компании мне необходимо выполнить первоначальное [b]подключение[/b] к [b] база данных[/b], как показано ниже, которая работает:
[code]# Connection to MSSQL with Kerberos + pyodbc

def mssql_conn_kerberos(server, driver, trusted_connection, trust_server_certificate, kerberos_cmd):
# Run Kerberos for authentifications
os.system(kerberos_cmd)

try:
# First connection attempt
c_conn = pyodbc.connect(
f'DRIVER={driver};'
f'SERVER={server};'
f'Trusted_Connection={trusted_connection};'
f'TrustServerCertificate={trust_server_certificate}'
)
except:
# Re-run Kerberos and try authentification
os.system(kerberos_cmd)
c_conn = pyodbc.connect(
f"DRIVER={driver};"
f"SERVER={server};"
f"Trusted_Connection={trusted_connection};"
f"TrustServerCertificate={trust_server_certificate}"
)

c_cursor = c_conn.cursor()

print("Pyodbc connection ready.")

return c_conn # Connection to the database
[/code]
Тогда у меня есть функция для чтения и обработки моего SQL-запроса (который находится в файле .sql, сохраненном в папке проекта):
[code]def call_my_query(path_to_query, query_name, chunk, connection):

file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query = file.read()

# SQL processing in chunks + time
chunks = []
start_time = time.time()

for x in pd.read_sql_query(query, connection, chunksize=chunk):
chunks.append(x)

# Concating the chungs - joining all the chunks together
df = pd.concat(chunks, ignore_index=True)

# Process end-time
end_time = time.time()

print("Data loaded successfully!")
print(f'Processed {len(df)} rows in {end_time - start_time:.2f} seconds')

return df
[/code]
Я также пытался запустить эту задачу через [b]Dask[/b], изменив функцию [b]"call_my_query"[/b], но по какой-то причине Dask вызывает проблемы с pyodbc.
[b]Альтернация[/b] "call_my_query" [b]для Dask[/b]:
[code]def call_my_query_dask(query_name, chunk, connection, index_col):

# Load query from file
file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query_original = file.read()

# Convert the SQL string/text
query = sqlalchemy.select(query_original)

# Start timing the process
start_time = time.time()

# Use Dask to read the SQL query in chunks
print("Executing query and loading data with Dask...")
df_dask = dd.read_sql_query(
sql=query,
con=connection_url,
npartitions=10,
index_col = index_col
)

# Process end-time
end_time = time.time()
print("Data loaded successfully!")
print(f"Processed approximately {df_dask.shape[0].compute()} rows in {end_time - start_time:.2f} seconds")

return df_dask
[/code]
Что вызывает эту ошибку:
[code]Textual column expression 'SELECT\n\t[COL1]\n\t, [COL...' should be explicitly declared with text('SELECT\n\t[COL1]\n\t, [COL...'), or use literal_column('SELECT\n\t[COL1]\n\t, [COL...') for more specificity
[/code]
Возможно, решение простое, и я просто смотрю в другом направлении.
Спасибо всем за любые советы, подсказки или потенциальные решения! 

Подробнее здесь: [url]https://stackoverflow.com/questions/79348687/problems-to-load-large-data-from-mssql-database-kubeflow-python-pandas-das[/url]