实战派带你玩真正的大数据,14周高强度特训!

锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

搜索
查看: 60661|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
# O, c5 `& n& S, F; C- Q
) q6 G# b! m* Y! l收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。! |3 ~9 d$ a$ y* Y% ^
' Z, _7 i& n/ ]- l4 S
错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后! ?2 H# {4 S' h0 P; m& e& k! W4 D

% L$ ]8 d( z; ~5 [8 L2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
1 L8 d6 J1 ~3 p5 e; Cjava.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
4 l& }& V, V; a        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
' T7 x3 e+ Q6 g) Y        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
, c0 g; [; s* i# s% i* H        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
6 L6 P* n6 P. @. v        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
  c% d" B6 ~" u( G; o        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)& n5 l- G; K$ b& @/ v: G
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
! K9 J, b0 ~% |; z! @        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222), T# p/ d- V3 Q
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)5 c7 P, G- A$ w; S; l& ?4 w
        at java.lang.Thread.run(Thread.java:722)% J' X6 G: u+ N0 ]
2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000- Q( P, f, @& A7 V2 {% _
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
3 |( F; k- |4 Y2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
+ V( d5 f) z8 m$ r$ j
+ R& s4 I8 ?) v! ?/ c' C9 x9 E! z: p" l$ F0 X/ _1 B* w3 s
原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
3 Z# o9 x- b) y6 Z0 e" ?( _8 |7 y
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。
8 a% {4 f4 K8 U) |# S4 f6 r' o& k5 a4 p8 n9 B3 k8 q
- @, L, U, _( A8 q% O
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。) E$ T9 S7 u3 ^
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑
4 q3 q0 [" G3 x1 e' T
7 ]. f4 T3 a, Z% Z, T, w错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container; e' k" ~; M) y+ r! E

' Q; H7 ^$ D& {# E5 B9 [14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. & V. ~$ |9 r1 J+ C
This token is expired. current time is 1398762692768 found 1398711306590/ {! g7 N7 U# t& }! ~
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)% R/ u0 k6 i; I. H
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
. ?7 [+ m- s7 v! Y1 ?3 a        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
7 B. |4 b! O" V1 o9 h; z% S        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
( T7 n& I+ j5 @& K        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
# R0 ~7 `3 ^" i5 J$ v7 N% k        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)# U! K5 l: O# i4 z
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)+ q( w' C% ^4 m
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
" x2 g& u7 P. K9 y        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
0 }$ s' I& k  B& O5 L        at java.lang.Thread.run(Thread.java:722)
# H  @* G6 n* u8 v- |2 F. B' a. Failing the application.3 {# ?' |6 }$ C) G* F. L
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
% l8 h2 D* Q% P
, C" V4 @( b9 Q4 p9 z( _# U问题原因:namenode,datanode时间同步问题
9 {1 {8 A2 w( j: k7 e7 `+ p1 h
: g& u+ c  L7 Y  |5 J; c& Q2 s解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
; o7 i5 v- p) v7 p最好在每台服务器的 /etc/crontab 中加入一行:
3 j2 W% ^! B9 v5 R0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑 ; ~2 G: J( o. m3 }# @! N

4 Y, J2 ?2 l( @' d2 H错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write! F# E; \% l/ n5 G6 N* S' b! w" E' h
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
8 D. b1 s% G: s8 M+ u! ajava.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]
3 b1 y9 W0 q9 J' o        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)0 k9 _7 }+ [. C9 Z) l. z' {
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
" L/ V2 D- D' ~/ n; f        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)7 n& r: c" c" k+ B
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
# q& u: V2 ^, s! m# t' T        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
6 Y4 y2 k1 k6 g1 {2 |  C1 a. J        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
" Q3 ]7 A4 v8 P3 d  |6 q- s% P        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
; b+ @" u) p) ]# `3 ~7 f        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)& K+ e! l- U. T/ P) W$ n
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
2 q+ M5 w. X8 p6 N        at java.lang.Thread.run(Thread.java:722)
% o. g% I" J' i3 g/ H4 Q3 [( x( R* M
原因:IO超时# M. \) j# [4 t5 J: g# t/ F; W
7 {8 ?6 a) q: ?! v7 |! Z
解决方法:+ c, D4 Q4 ^2 u) w6 W% m
修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
% N, K, }7 F$ d1 ~    <property>  V! k: Z/ G% }9 E
        <name>dfs.datanode.socket.write.timeout</name>, q5 T5 a6 `7 a; j
        <value>6000000</value>
$ n; w% Q6 \) }7 m3 |( k/ }) @6 C( D    </property>9 S. G) w: w% _; _8 x
/ N" e! P, K; K( c$ V" W. \
    <property>
) D" g. X+ e# A4 G- B        <name>dfs.socket.timeout</name>
7 }8 Z5 c& S' m5 ?. Z# |1 {+ P        <value>6000000</value>( z1 Y' L) v' [# a! N! Y
    </property>
. `9 T2 m4 m5 o+ a; K$ ^% ^4 @( \. I, B0 ?
注意: 超时上限值以毫秒为单位。0表示无限制。/ z# p) t, q( f6 x( [5 W/ G# Y
2 A7 H  d+ ]& r4 @

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation1 i& ~! p0 O- X, S
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:500102 W. t6 U; K7 q
java.io.IOException: Premature EOF from inputStream
% a/ D6 s: P+ {# E        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
) E, @- V0 `8 v  `2 _        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
/ J  o# R* \9 g) d) X        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
6 M. L3 `" ^0 R0 K2 f        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)3 M3 u3 p7 H7 c" ]+ C9 w& T
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
1 z+ t0 q( n3 ?. V: F        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
; E$ S+ O4 ]* }2 U" K$ x: }        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)* _* v5 p8 J6 @
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
5 g* X, @  R  C* z# D        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)! B6 s# h: p* |+ D6 P: ]1 ?- g
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
4 E4 F7 e- ]! Y9 j* a" r        at java.lang.Thread.run(Thread.java:722)# \7 O# W% n! J8 ?$ ]) m6 M
# f/ s. a. F* e+ `& t# C0 K
原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。6 ]0 g9 e3 m2 f# G4 [- N: @

4 p; u% ^% a) w  B解决办法:' Q+ @9 X" W' o8 @' L/ |
修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):; M( \' ~0 A# l6 u/ }
<property>
" `9 a) \& O' j. p        <name>dfs.datanode.max.transfer.threads</name> % V* i' r) s1 z) P- U6 ]
        <value>8192</value>
% N! b9 ^2 e. t- S</property>
" ]: {7 R, _4 @  ^6 e6 D拷贝到各datanode节点并重启datanode即可
1 O2 C, g) j6 y
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.! ]* R. {  }# j+ H' G7 E. `
2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed 6 C4 V, d: I/ J- p: n
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
% |* w! D$ F2 j" i% d. m        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
5 g; ]+ a9 t# T9 p        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)
  Y4 B8 d5 q  |  F        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
8 E5 l2 a1 L" O2 o# A. C9 ~/ V        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
# O5 Y) _: U4 h' [8 y        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
% |! d, L6 l1 t6 |6 M        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
8 D' B- k! u+ L# \' t        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)# v) {( }4 d1 o) U5 w
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)* `& h7 s7 Z# k) N9 S6 D1 \
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
! G9 H7 Y: t! Y2 |        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)# A2 b. C5 K% Y  j/ @3 H: ?$ C
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
9 B$ _/ }. I3 i) H# v        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)
1 f" ^( A4 ]& m* x  b3 ^        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)$ J$ c9 ?2 I$ h% Q# u: k& I" U
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
( G& x5 J9 W) E  K7 v$ r        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)/ u6 l0 Y% ?8 o3 P' m/ t8 o9 j
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
6 c* o& T6 s$ l5 e+ [7 K4 J: h6 i2 x- ^8 x8 H8 o* X
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。
, k# K1 m7 X7 g5 j2 k0 e2 J  M) _* m/ ?) D' y) t( P) q6 T  \8 \
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:$ o: I3 p0 h4 \
<property>
/ ~2 J  ]' ?4 s' C& e& B' l9 c7 O  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>/ P. b1 P9 {$ a8 w+ F$ D3 \
  <value>true</value>, Q. _5 K: P5 Q
</property>
# }( l& J* X- c6 x<property>
& w0 N) n  g4 D2 u  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>0 o( u* T! V1 }# r4 r% n' q
  <value>NEVER</value>
1 ~) v+ g/ }4 [: m6 N" g- l</property>
" Y- v. \( a; h" L3 t8 M$ J4 B& d. g+ @" Q) u
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
" M6 f" D8 N$ g6 s! i! ]. s对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
3 [! D% G/ g  m& {
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑
- n9 \/ b1 N4 d" W! s6 A( J
# z! L% P7 g- U3 U8 F错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for
. J" c( u  Y2 |1 H  I% f14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED/ a, F! F& s) T6 t# c4 e
Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out
1 m& B# f; K" t" \. A( e        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
  @0 R) R# [  R0 F/ O- ]        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
2 g2 }* X. A  c5 ~/ s4 U' ?; f* C5 v8 B3 _        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
6 Y* s8 c& G2 \$ n2 |        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
, H: e+ a4 T1 @& k' f4 _8 F        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)% ?) n; P8 V  l# R
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
& s5 u- F  t. d$ {        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
, j6 F1 u; O; _, O9 t3 `        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)
0 B" V3 A7 J, B3 A9 n2 Z4 v) O7 v        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)4 L5 x+ Z0 @' A
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)5 v0 m! Q9 i8 |8 b5 Z! e% y
        at java.security.AccessController.doPrivileged(Native Method)9 F% m5 T0 v) u% I& W( t  ~' K
        at javax.security.auth.Subject.doAs(Subject.java:415)
' k* k% j. q3 Z5 u& r        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)2 t$ W/ g9 r; g. e" O3 [
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
4 ?1 n% g+ Q# x0 _, O
  O/ U# H8 j, H7 h# r. MContainer killed by the ApplicationMaster.  g0 v6 j& X9 }* |8 h8 [- r  R
0 w- e3 W) q2 Y" z1 M% w
原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
% A7 L/ v, X  o+ ]" b  @, q1 I7 M0 v! |& |
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
6 y! D* S% N) l/ W% F<property>
* J' P1 T) x# K4 s9 B<name>hadoop.tmp.dir</dir>
8 ?# l5 L! U5 I6 F- f( c<value>/data/tmp</value>7 h' i  K( |: k; A1 t& X. \
</property>9 U, |& Q* a+ q, H* R2 Q
然后重新格式化:hadoop namenode -format" I: g- P) i* E2 l) ?+ h
重启。! Q6 L$ P* K+ [7 s- v
2 d2 Q; \; b) F8 Y+ k
# |+ N$ ~2 Y7 o( X8 b6 a* n% T! U
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2
: T, c0 G$ e+ N7 `1 R' ijava.io.IOException: Spill failed
, @& N6 P" M' Z! w9 P4 a- V& {        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)" l2 s' \* ~# H" ~/ }- I' z
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)9 V8 F* t5 |( a1 k
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699): h6 d" Q0 r: b2 L7 o1 s1 r
        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997); d$ X3 \& o3 u9 Q, O# q, \3 |" E
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
- ~# u) H0 z6 t        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
6 z7 |* N" {& j) L6 U6 m6 b% r0 D  N        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)5 U, P9 x. F6 L
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
2 u+ |) ?2 o5 g3 M0 @, @        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)1 S7 ]& k: b. E$ {# _
        at java.util.concurrent.FutureTask.run(FutureTask.java:166). @3 t( T1 ]7 |# H7 o' y7 [
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  b1 B. u0 B5 {        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)$ S3 t$ ^9 d7 W! B; B
        at java.lang.Thread.run(Thread.java:722)1 f$ c! H2 J4 d3 \
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
8 X0 E1 T" Y9 w3 h$ p) t        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)" n* E5 ], N5 A8 T3 a' S- u" S7 E* [
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)* Y8 m+ ~, V3 `) c
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
% R/ j( ^# C  }2 l        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
, X8 J/ }6 L' P        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
- [9 }6 K. ?& Z% E( a        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
& S  F. B% T5 o        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
* B/ f7 B) `# }- _" v- o3 v+ v9 E7 D$ T/ s8 _3 }' J- |& H4 N- y

- ^, T1 a1 g+ |0 i4 Y错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)9 o6 |8 k" H) d8 u5 G0 w! `( w0 g1 i
解决办法:清理、增加空间
& ]8 y: v$ D8 `$ U/ y
( j: i" E& l2 s0 A' R8 B+ X5 ]- ?6 Z- e( y, m+ ?
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP$ [: x5 g# W: F% E
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
' w( o; N6 f) Y( _, |& B' E6 y) D* X& Q$ C. a& g2 {  e  V
郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
1 s6 }+ ~0 A8 M" w+ ~! h: |* B9 ]# M" _3 D8 Q4 p
这个问题告诉我们,运行过程中的监控很重要。% C  c- {  w  P3 H  o' ^
学大数据 到大讲台
回复 支持 反对

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50
  s; M/ r3 c0 J/ K' ?2 G兄弟,你最后一个问题是怎么解决的,有解决方案吗?
, @9 Y( g7 Q  g3 N, D  x( P
不好意思,才看到。
  S- m3 A* s2 u; W& R; g1 g9 r我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部