Spark Databricks: соединение Stream-Stream LeftOuter возвращает пустой результат

Spark Databricks: соединение Stream-Stream LeftOuter возвращает пустой результат ⇐ Python

1 сообщение • Страница 1 из 1

Гость

Spark Databricks: соединение Stream-Stream LeftOuter возвращает пустой результат

Цитата

Сообщение Гость » 04 мар 2024, 18:46

Databricks, with Delta Live Tables, Spark 3.4

I have a streaming dataframe (let's call it "original") containing some records. I then filter this table based on some conditions, modify some column values and get a new "modified" dataframe.

I want to merge these two dataframes in a way that the records that are in "modified" replace the corresponding records in the "original". The approach I am taking right now is to "subtract" the modified dataframe from the original, and then union the result with the "modified" dataframe.

I have an ID field for each of the records.

I soon realized what I want to achieve can be done by either pyspark's subtract() function, or a left anti join. However, both of these are not supported if the right side dataframe is a streaming one. So, I tried to replicate a left-anti join with a left-outer join:

subtracted = original.join(modified, original['ID'] == modified['ID_mod'], 'leftOuter') \ .where(modified['ID_mod'].isNull()).select(original['*']) However, then I got an error saying stream-stream left outer joins are only supported with watermarks and time range. So, following Spark's documentation, I did the following:

@dlt.table def final_records(): # origTime and modTime are two timestamp columns original = dlt.readStream("original_table").withWatermark('origTime', '2 hours') modified = dlt.readStream("modified_table").withWatermark('modTime', '3 hours') # Should give me original without modified records subtracted = original.join(modified, expr(""" ID = ID_mod AND modTime >= origTime AND modTime

Источник: https://stackoverflow.com/questions/779 ... pty-result

1709567174

Гость


Databricks, with Delta Live Tables, Spark 3.4
 
I have a streaming dataframe (let's call it "original") containing some records. I then filter this table based on some conditions, modify some column values and get a new "modified" dataframe.
 
I want to merge these two dataframes in a way that the records that are in "modified" replace the corresponding records in the "original". The approach I am taking right now is to "subtract" the modified dataframe from the original, and then union the result with the "modified" dataframe.
 
I have an ID field for each of the records.
 
I soon realized what I want to achieve can be done by either pyspark's subtract() function, or a left anti join. However, both of these are not supported if the right side dataframe is a streaming one. So, I tried to replicate a left-anti join with a left-outer join:
 
subtracted = original.join(modified, original['ID'] == modified['ID_mod'], 'leftOuter') \                      .where(modified['ID_mod'].isNull()).select(original['*'])  However, then I got an error saying stream-stream left outer joins are only supported with watermarks and time range. So, following Spark's documentation, I did the following:
 
@dlt.table def final_records():     # origTime and modTime are two timestamp columns     original = dlt.readStream("original_table").withWatermark('origTime', '2 hours')     modified = dlt.readStream("modified_table").withWatermark('modTime', '3 hours')          # Should give me original without modified records     subtracted = original.join(modified, expr("""                      ID = ID_mod AND                      modTime >= origTime AND                      modTime 

Источник: [url]https://stackoverflow.com/questions/77996877/spark-databricks-stream-stream-leftouter-join-returning-an-empty-result[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Оболочка Spark: spark.executor.extraJavaOptions не разрешено устанавливать параметры Spark.

Последнее сообщение Anonymous « 24 сен 2024, 08:55
Добавлено в форуме Python

Anonymous » 24 сен 2024, 08:55 » в форуме Python

Я создал эту программу Spark Shell, но при ее запуске возникла ошибка:
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements!

PS...

0 Ответы

52 Просмотры

Последнее сообщение Anonymous
24 сен 2024, 08:55
Databricks не может найти файл csv внутри колеса, которое я установил при запуске из блокнота Databricks.

Последнее сообщение Anonymous « 31 мар 2024, 17:51
Добавлено в форуме Python

Anonymous » 31 мар 2024, 17:51 » в форуме Python

Я изучаю Spark, поэтому в качестве задачи нам нужно было создать колесо локально, а затем установить его в Databricks (я использую Azure Databricks) и протестировать его, запустив из блокнота Databrick. Эта программа предполагает чтение файла CSV...

0 Ответы

92 Просмотры

Последнее сообщение Anonymous
31 мар 2024, 17:51
Скопируйте модель машинного обучения из одной рабочей области Azure Databricks в другую рабочую область Databricks.

Последнее сообщение Anonymous « 08 окт 2024, 12:11
Добавлено в форуме Python

Anonymous » 08 окт 2024, 12:11 » в форуме Python

Я запустил приведенный ниже код для экспорта модели ML в mlflow на основе Azure Databricks , но, похоже, получаю эту ошибку: Хост или токен MLflow настроены неправильно .
Я не могу понять, в чем проблема. URL-адрес рабочей области и токен PAT...

0 Ответы

36 Просмотры

Последнее сообщение Anonymous
08 окт 2024, 12:11
Скопируйте модель машинного обучения из одной рабочей области Azure Databricks в другую рабочую область Databricks.

Последнее сообщение Anonymous « 08 окт 2024, 13:11
Добавлено в форуме Python

Anonymous » 08 окт 2024, 13:11 » в форуме Python

Я выполнил приведенный ниже код для экспорта модели машинного обучения в mlflow на основе Azure Databricks , но, похоже, получаю эту ошибку
MLflow host or token is not configured correctly

Я не могу понять, в чем проблема. URL-адрес рабочей области...

0 Ответы

27 Просмотры

Последнее сообщение Anonymous
08 окт 2024, 13:11
Скопируйте модель машинного обучения из одной рабочей области Azure Databricks в другую рабочую область Databricks.

Последнее сообщение Anonymous « 09 окт 2024, 08:20
Добавлено в форуме Python

Anonymous » 09 окт 2024, 08:20 » в форуме Python

Я выполнил приведенный ниже код для экспорта модели машинного обучения в mlflow на основе Azure Databricks , но, похоже, получаю эту ошибку
MLflow host or token is not configured correctly

Я не могу понять, в чем проблема. URL-адрес рабочей области...

0 Ответы

30 Просмотры

Последнее сообщение Anonymous
09 окт 2024, 08:20

Вернуться в «Python»