Spark SQL 操作 hive 过程 rename 过程时间长

本贴最后更新于 190 天前,其中的信息可能已经时过境迁

Spark SQL 操作 hive 过程 rename 过程时间长

情况简介

hive 版本:1.2.1,spark 版本:2.3.0

2 亿数据去重 spark 任务时间:12.5h(4h(去重)+2.5h(不知道 spark 在干嘛,driver 端没有日志,executor 也没有日志)+6h(Rname 操作))

部分 Rename 日志。

2019-09-19 22:34:22,097 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00002-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00002-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,111 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00003-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00003-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,128 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00004-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00004-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,143 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00005-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00005-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,160 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00006-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00006-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,175 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00007-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00007-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,192 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00008-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00008-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,207 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00009-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00009-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,223 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00010-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00010-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,238 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00011-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00011-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,253 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00012-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00012-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,267 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00013-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00013-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,281 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00014-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00014-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,296 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00015-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00015-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,315 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00016-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00016-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,331 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00017-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00017-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,345 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00018-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00018-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,361 [Driver] INFO  hive.ql.metadata.Hive  - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00019-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00019-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true

spark SQL 执行 hive SQL 任务

  1. 会现在目标表中(1.21 版本之后是默认位置目标表的文件夹)生成一个以。hive-staging 开头的临时文件夹,结果会在临时文件夹存放
  2. 执行完成后会,将临时文件夹 rename,放到对应的目标表文件下。

企业微信截图 15689482919715.png

从代码中可以看出,有两种策略:如果源目录和目标目录是同一个根目录,则会源目录下的每个文件执行复制操作。反之,执行 remane 操作(只涉及 namenode 元数据,不会有额外数据操作)。

解决方案

修改 hive-site.xml 配置文件参数:

<property>
	<name>hive.exec.stagingdir</name>  
	<value>/tmp/hive/.hive-staging</value>
	<description>hive任务生成临时文件夹地址</description>
</property>
<property>        
	<name>hive.insert.into.multilevel.dirs</name>
	<value>true</value>
	<description>hive.insert.into.mulltilevel.dirs设置成false的时候,insert 目标目录的上级目录必须存在;trued的时候允许不存在</description>
</property>

参考资料

  1. hadoop,hive 中的 mv(rename)操作
  2. hive 添加完 hive.exec.stagingdir 参数,有的 SQL 报 FileNotFoundException 错
打赏 5 积分后可见
5 积分
  • Spark

    Spark 是 UC Berkeley AMP lab 所开源的类 Hadoop MapReduce 的通用并行框架。Spark 拥有 Hadoop MapReduce 所具有的优点;但不同于 MapReduce 的是 Job 中间输出结果可以保存在内存中,从而不再需要读写 HDFS,因此 Spark 能更好地适用于数据挖掘与机器学习等需要迭代的 MapReduce 的算法。

    73 引用 • 45 回帖 • 557 关注
  • Hive
    14 引用 • 6 回帖 • 1 关注
回帖
请输入回帖内容...