Я уже создал таблицу новостей в hive и сохранил ее как Parquet.
Код: Выделить всё
+----------+----------------------------------------------------+----------------------------------------------------+-----------------+----------------------------------------------------+----------------+-------------+
| news.id | news.link | news.headline | news.category | news.short_description | news.authors | news.date |
+----------+----------------------------------------------------+----------------------------------------------------+-----------------+----------------------------------------------------+----------------+-------------+
| 5 | https://www.huffingtonpost.com/entry/7-amazing-name-generators-photos_us_5bacfb45e4b04234e85527ea | 7 Amazing Name Generators (PHOTOS) | COMEDY | Let's be honest: most of our names are pretty boring. Glenn Close? John Mayer? Barack Obama? C'mon, we can do better than | Seena Vali | 2012-01-28 |
| 4 | https://www.huffingtonpost.comhttp://www.refinery29.com/1-girl-4-looks-lily-kwong | 1 Girl, 4 Looks: Meet NYC's Hottest New Model, Lily Kwong | STYLE & BEAUTY | From Refinery29: When we tell you Lily Kwong is everything, we're not using the same inflection as when, say, we're talking | | 2012-01-28 |
| 3 | https://www.huffingtonpost.com/entry/girl-with-the-dragon-tattoo-india_us_5bb3e3f1e4b066f8d2513c8e | 'Girl With the Dragon Tattoo' India Release Canceled After Local Censor Board Deemed Film 'Unsuitable' | ENTERTAINMENT | "Sony Pictures will not be releasing The Girl with the Dragon Tattoo in India. The Censor Board has adjudged the film unsuitable | | 2012-01-28 |
| 2 | https://www.huffingtonpost.com/entry/dont-think-the-chemical-brothers-concert_us_5bb21e37e4b0171db69d5bf6 | 'Don't Think': A Look At The Chemical Brothers' Concert Film, Set To Hit Theaters | CULTURE & ARTS | Amid cheers and the occasional "Here we go!" from the theater's speakers, the duo danced alone for a few songs. Eventually | Kia Makarechi | 2012-01-28 |
| 1 | https://www.huffingtonpost.com/entry/black-smoker-vents-new-species_us_5bb10820e4b09bbe9a596463 | 'Black Smoker' Vents: New Species Discovered Near Deepest Undersea Hot Springs (PHOTOS) | ENVIRONMENT | Photos and captions courtesy of University of Southampton and NOC. Connelly's co-leader, marine biologist Dr. Jon Copley | | 2012-01-28 |
+----------+----------------------------------------------------+----------------------------------------------------+-----------------+----------------------------------------------------+----------------+-------------+Но при использовании Spark для чтения из таблицы куста все строки являются просто копиями заголовков (я удалил столбцы идентификатора и даты):
< pre class="snippet-code-html lang-html Prettyprint-override">
Код: Выделить всё
+----+--------+--------+-----------------+-------+
|link|headline|category|short_description|authors|
+----+--------+--------+-----------------+-------+
|link|headline|category|short_description|authors|
|link|headline|category|short_description|authors|
|link|headline|category|short_description|authors|
|link|headline|category|short_description|authors|
|link|headline|category|short_description|authors|
+----+--------+--------+-----------------+-------+Это мой код Spark:
Код: Выделить всё
spark = SparkSession.builder \
.master("spark://spark-master:7077") \
.appName("Read Hive Table 2") \
.config("hive.metastore.warehouse.dir", "hdfs://namenode:9000/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
schema = StructType([
StructField("id", LongType(), True),
StructField("link", StringType(), True),
StructField("headline", StringType(), True),
StructField("category", StringType(), True),
StructField("short_description", StringType(), True),
StructField("authors", StringType(), True),
StructField("date", DateType(), True)
])
# Define the JDBC URL for HiveServer2
jdbc_url = "jdbc:hive2://hiveserver2:10000/vulm4_db"
username = "hive" # Replace with your username
password = "password" # Replace with your password
# Load the Hive table data
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "news") \
.option("user", username) \
.option("password", password) \
.option("header", "false") \
.option("encoding", "UTF-8") \
.option("multiline", "true") \
.option("nullValue", "NA") \
.load()
df2.drop("id").drop("date").show(5)Кстати, я использую набор данных по категориям новостей в Kaggle
Подробнее здесь: https://stackoverflow.com/questions/791 ... rnal-table
Мобильная версия