Я создаю ETL для получения данных из API, их преобразования и загрузки в mongodb.
Я использую pyspark для распространения этого процесса и пытаюсь использовать Airflow для его автоматизации. >
Однако я столкнулся с препятствием. После настройки Spark, Airflow и всего остального мое задание Spark наконец-то выполняется, но с ошибкой
[2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - Driver stacktrace:
[2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSetManager: Lost task 6.2 in stage 0.0 (TID 19) on 172.21.0.6, executor 0: java.io.InvalidClassException (org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095) [duplicate 19]
[2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSetManager: Lost task 5.3 in stage 0.0 (TID 20) on 172.21.0.9, executor 1: java.io.InvalidClassException (org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095) [duplicate 20]
[2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO DAGScheduler: Job 0 failed: foreach at /opt/***/AstroMRS/movies.py:25, took 3.357094 s
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - Traceback (most recent call last):
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/opt/***/AstroMRS/movies.py", line 36, in
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - etl('/movie/popular', 500, 'movies_collection')
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/opt/***/AstroMRS/movies.py", line 25, in etl
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - rdd.foreach(lambda page:
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - ^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 1766, in foreach
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2316, in count
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2291, in sum
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2044, in fold
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 1833, in collect
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
[2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 16) (172.21.0.9 executor 1): java.io.InvalidClassException: org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095
Я создаю ETL для получения данных из API, их преобразования и загрузки в mongodb. Я использую pyspark для распространения этого процесса и пытаюсь использовать Airflow для его автоматизации. > Однако я столкнулся с препятствием. После настройки Spark, Airflow и всего остального мое задание Spark наконец-то выполняется, но с ошибкой :) [code][2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - Driver stacktrace: [2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSetManager: Lost task 6.2 in stage 0.0 (TID 19) on 172.21.0.6, executor 0: java.io.InvalidClassException (org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095) [duplicate 19] [2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSetManager: Lost task 5.3 in stage 0.0 (TID 20) on 172.21.0.9, executor 1: java.io.InvalidClassException (org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095) [duplicate 20] [2024-09-22, 13:13:46 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - 24/09/22 13:13:46 INFO DAGScheduler: Job 0 failed: foreach at /opt/***/AstroMRS/movies.py:25, took 3.357094 s [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - Traceback (most recent call last): [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/opt/***/AstroMRS/movies.py", line 36, in [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - etl('/movie/popular', 500, 'movies_collection') [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/opt/***/AstroMRS/movies.py", line 25, in etl [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - rdd.foreach(lambda page: [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - ^^^^^^^^^^^^^^^^^^^^^^^^ [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 1766, in foreach [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2316, in count [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2291, in sum [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2044, in fold [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 1833, in collect [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - File "/home/***/.local/lib/python3.12/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. [2024-09-22, 13:13:47 UTC] {spark_submit.py:488} INFO - : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 16) (172.21.0.9 executor 1): java.io.InvalidClassException: org.apache.spark.scheduler.Task; local class incompatible: stream classdesc serialVersionUID = -6188971942555435033, local class serialVersionUID = 553100815431272095 [/code] вот как я сделал свою сборку докеров: [code]version: '3' x-airflow-common: &airflow-common build: context: . dockerfile: Dockerfile env_file: - airflow.env volumes: - .:/opt/airflow/AstroMRS - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs user: "${AIRFLOW_UID:-50000}:0" depends_on: - postgres networks: - movie-net
Я пытаюсь запустить сеанс Spark в Jupyter Notebook на компьютере EC2 Linux с помощью кода Visual Studio. Мой код выглядит следующим образом:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( spark_app ).getOrCreate()...
В моем проекте я использую Spark-Cassandra-Connector для чтения таблицы из Cassandra и дальнейшей обработки ее в JavaRDD, но я столкнулся с проблемой при обработке строки Cassandra в javaRDD.
org.apache.spark.SparkException: Job aborted due to...
Я сталкиваюсь с вышеуказанным исключением, когда пытаюсь применить метод (ComputeDwt) к входным данным RDD[(Int,ArrayBuffer )].
Я даже использую расширения Параметр сериализации для сериализации объектов в Spark. Вот фрагмент кода....