实战派带你玩真正的大数据,14周高强度特训!

锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

搜索
查看: 52756|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑 : l- \& `7 x; g/ K+ J/ [2 s: `

# n% k( k& I% g- y1 {7 }收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。0 R' m+ w1 H3 E3 g2 M- C
- ^% k' s# f6 J9 t" S+ }
错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后
; z& d$ S0 i9 U3 Q& a% S1 n' O$ T  C
2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000* E: a) P; ?: i1 M
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
3 e) ^7 \0 C1 @' @6 K" [4 l4 J0 Q        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
3 Y! k6 Z7 d- W+ k' @        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
; V$ E% h  l5 _        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)! g; p0 }6 c5 |" w' ]- [# R
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
+ c7 X8 H# d. j, P& t( t+ U        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)% V  M: F0 @( @: c
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
' X5 w: g+ M1 p0 e4 [# F$ g        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
! l9 i5 C4 M) C3 I: F        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)  I$ Q/ E; b- C0 Y  g' Z, |
        at java.lang.Thread.run(Thread.java:722)
' ~+ n; ?% H, _4 S1 C! i2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90004 ~% P6 F0 `+ A* ?$ }4 |! U( r% E
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
# c8 ?, x9 b5 I3 o1 ?  J2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode. }+ b5 B2 u' e

3 P# P$ t, m, s. ]+ m6 E
# m9 v5 i' c2 ~7 d6 }, r原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
, R# i% G/ O9 V/ u" p3 r- P) a( P) s7 J/ g- F
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。  q1 O! Z. Q$ x- \5 I

/ Z8 F3 k& `, T6 V, n0 U, {# \: E7 Q$ v  v- P; B+ j- _/ z
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。' p. }; J8 }/ C8 i
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑
- `9 L. J$ J0 R+ i+ Z" w  d& B# x+ a4 r% b7 l' O8 S
错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container+ W* ~9 @$ C# w; ]# i% [; ~5 ]

9 I; O5 f& X$ v* {/ V: N4 |1 `4 x: H14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. ! o9 m( x# Y, K: Z1 ~
This token is expired. current time is 1398762692768 found 1398711306590
2 D2 W0 e0 s: }6 m8 }0 B        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)$ }* p! \) c+ o3 n5 E4 O7 I
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)7 B) d2 l4 e3 J' J, G1 b
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
2 ?  w* d* ~. m1 Y& b        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
/ E; y5 u. ~7 s# j% _$ L        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
( j& L, c7 _8 T: C) S4 k        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)& B, d1 x( }% `& L7 f5 g, Q
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
" Z8 ^# z2 z% S4 X- e: c% t        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)0 f: q+ R( Z1 Z; {1 [( h
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615): G$ v  H2 Q) }" M
        at java.lang.Thread.run(Thread.java:722)
8 F" [/ V) t& k9 A7 [. Failing the application.
  V( v/ X) v( [9 U. ~7 T14/04/29 02:45:07 INFO mapreduce.Job: Counters: 04 ]! [% v) h; U

; Z3 B- A+ H$ }& [. o问题原因:namenode,datanode时间同步问题' Y( u  {/ T4 f+ d4 G0 D" X
! z6 z3 N+ T+ Q
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。5 b( L  i4 B' W1 ~. t0 ]$ ^
最好在每台服务器的 /etc/crontab 中加入一行:
! Z: ?( V$ N$ Q/ R! d) `/ `0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑
+ f/ B# d! b( A4 t5 o  O' Y
; F5 q/ `  q2 b错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
7 s9 t' }- u- E/ V  B2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010* |2 q% o1 Z7 F1 M- i
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]: d: M; J9 _' j" k  O/ G& F
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
/ w9 ^' D$ l: y; f7 Q: B        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)1 M& z/ B- _1 s
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
& T, h4 ~6 P0 O1 X        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)6 `5 P% i' g( I. I$ F- y$ N
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)4 [0 x- D9 i3 x6 }' G
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)' y+ V; v8 m. R" V
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)- a5 e) V6 P( p* E7 F7 x
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)! a/ U% l! o! g2 G
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
# K0 ?/ o- {9 w# e        at java.lang.Thread.run(Thread.java:722): q( K, n! O9 p/ V
# e( ]7 V* b5 l2 C, |" r
原因:IO超时6 ^5 G. h" }6 r. W

- l: b% q8 x: U, h' \1 g解决方法:
3 b- m- m# g4 \5 l$ L修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
' v+ P& Q' F" t' u, D    <property>
! y: d4 {! j0 J  b        <name>dfs.datanode.socket.write.timeout</name>' }# `7 y8 {: U8 w: C
        <value>6000000</value>
: V* a6 w7 x5 M, G, E, D5 L    </property>5 b- n# j8 A, ?4 E- a7 L4 M

; p7 b, S! u. P: G* ]; J    <property>
& k5 |# ]( q: @        <name>dfs.socket.timeout</name>
3 h; }  C& w: j9 ~  v$ r# N' L1 \        <value>6000000</value>
# {3 C: `+ m5 U$ k5 Z- i    </property>
% z8 K, u6 ~! E# K( m" w9 C* D1 |
3 `/ O. p" h6 |3 ?注意: 超时上限值以毫秒为单位。0表示无限制。- G: `$ c" b% n2 F! A
. v5 v2 v, T9 g

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation* F0 G" ~- ]2 M
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:500107 E* y2 \/ l& y! ], v
java.io.IOException: Premature EOF from inputStream
% t5 G1 b& j1 S0 A- l        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)& j$ o! T- B# H- a- O; L: K' T
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
0 u0 M1 }! j7 s7 q6 c0 T; z2 C        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
$ s3 ]! d( a0 o5 c7 X        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
! l3 N( G3 x: `        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
6 b* a8 G% u- _# |& H        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
6 ^6 B! H3 p# J3 }. u9 O" l1 A: s2 k        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
2 D6 R8 w8 _8 X        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)8 U( E: X# h$ W9 E: k
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)& h# Z0 J) w$ G: B( _, M4 W' {
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)7 |# Q. e6 ~9 ~
        at java.lang.Thread.run(Thread.java:722)+ P3 |8 R" F& ]/ F3 a

3 R4 l4 a/ a5 ]1 G原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。. A. S  p) q# h* q: {6 n5 ?0 x) [1 d

4 \" U7 g7 j7 Q解决办法:
2 s% b- b( \0 ^0 x* k9 ]% _; q修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):/ X4 p9 M3 ]6 X* q
<property>
( d* K7 F4 g0 A3 k; O        <name>dfs.datanode.max.transfer.threads</name> / ^. P. y1 p7 y- G6 ?! P
        <value>8192</value> ; e$ C& C1 X2 x$ P* J8 g3 M. S
</property>  r3 _: s3 e  E. }# o
拷贝到各datanode节点并重启datanode即可
9 }  g* |( W1 k( I# U3 G
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
6 x" n: d$ e- t1 U+ Z( w2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
( i  _7 q& w3 I( X. Sorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
9 o- j' L  R  ]* \: G% }8 L        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
- }$ }% e1 w) K! k3 ~. G+ u) O- z        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)( C0 v2 `! z, C8 `4 \/ \
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
0 L! ]% S2 ]! I, K. `. M! i9 V        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)8 |1 s. u' v+ |
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)+ t3 E) u1 T; M7 k' b% I
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)# A8 S& x: |$ }9 }  \0 j3 _
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)2 ~) I* ?/ e( B2 b$ j$ {
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)( i: I6 w- E+ o- u8 k' W
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
' T8 Q- {* M' v2 S8 S: G        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)1 a4 B) i5 y+ v. u
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
. S% [/ j  U8 N% z( ?" P2 B        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)$ J5 r$ a3 R4 ^' P+ `
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)
/ y- [" R( d) e$ s" w# m% v        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
9 J: A) r# L+ T( d& c        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
4 N$ {+ T$ |$ R9 }' i2 A8 @( |        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)* G3 G$ k/ K1 j+ |
' \# L9 I$ r- N
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。
" y1 Z9 b9 ], `6 g, o& F; B9 O9 n* N5 O' P4 P
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:. ^+ F& {6 x2 t+ k: I" h
<property>
8 m: T- j7 E  _4 y8 [$ U  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>9 x. C8 r' n3 o/ X3 u$ Z
  <value>true</value>$ G$ c1 m, G7 x. d' `3 M
</property>
2 s5 r. W. _: o<property>
! h$ j$ A0 s4 @: c  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
' t) e% F$ F- f6 A  <value>NEVER</value>, o3 h1 K5 t0 y4 [- S9 |
</property>% n. ^& c" o5 I6 W+ @8 x( `
- _+ D2 ]$ L$ g! o0 M9 v% {- w/ j
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
" n  R3 s7 J# Y- o$ f$ _对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
) s: W8 `, _9 F- y) E# Z  L
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑
. n- K+ q- X) }! V! L1 l2 z& i: {9 z" w9 w/ R( \
错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for ) W# y3 H. |$ e
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
7 r; t5 Q' M2 j4 V# r! V4 QError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out4 u4 @$ K# c8 f+ c
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398), w0 K1 I% c2 Q7 b& ^2 \
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
7 Q$ k& f+ P9 C* _3 p, Z/ Z, M        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
/ S9 ~9 ?" J# r% A        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)! J, G+ }) N! T% L! P% Y
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)4 F; v5 @0 k  |* w; [
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
' p2 w8 r: U; a; j  x: N2 q        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)' H7 B' [- c/ s# M; S
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)9 i, |% f* b( \# X+ Q
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)4 Y9 D# R# I) V4 [4 h, A
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
0 M& ^0 `5 ?( D  Q2 v1 |        at java.security.AccessController.doPrivileged(Native Method)/ e; A9 ]1 [+ ]0 O, L
        at javax.security.auth.Subject.doAs(Subject.java:415), X8 d- F7 S! b; L' y) i+ }" X" Y3 V
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
2 Q3 _5 [' C) w) C' S' K; O4 p. ?        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)' B6 w. H8 S0 ^* I( [3 o% K) C3 Q2 H
5 l6 Q1 J7 N0 P. [2 c$ X6 S$ C
Container killed by the ApplicationMaster.& M: x8 G; _; i7 ~$ j& _4 \9 Q9 p

) k1 g0 j/ @5 K( L# q原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。8 \& H; a9 q- n% f1 O4 q$ E0 e, A
: L  p; V* C8 s: a' k) @2 A: C/ T
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:& D7 m+ l* @9 ], Q& {- c
<property>/ l$ I, K" v9 x+ u0 t/ `
<name>hadoop.tmp.dir</dir>
, f  b# E5 a& J" d3 `6 L<value>/data/tmp</value>
' f( B+ R' q; S  N, W2 @; J</property>2 A/ B/ F3 `0 |+ K% j4 o. b
然后重新格式化:hadoop namenode -format% e! l7 y+ Z& \  `% \
重启。
3 T, g7 H: H7 n1 o# |; T) s2 |% V
- u% U8 W# z/ x& k; E
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2
3 A5 s) |( d, [  n6 O$ U/ zjava.io.IOException: Spill failed
5 W% f, D2 }7 v        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)! o: Y" A& C6 y7 C
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)' G, U. c- s* I
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)( w8 e; m% R) @6 r" g
        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)" C8 d8 I0 ^- T- S1 u* Y2 x
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
* G7 s5 ?% u; L* U, u" Z' ^4 t& k        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
. r/ b  t4 k" a3 R5 n! `        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)& y7 ~  ^+ r0 B3 G  k
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)% F3 F7 E: h- {! I0 f% i
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
; ]; L: L% O6 Z/ G# R        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
! G' ?% G- k% c/ d- N) P2 c        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)5 J8 G, f6 ]" b/ \0 l
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)- ~6 v) S$ H* ?
        at java.lang.Thread.run(Thread.java:722): x( T+ d3 k7 Q- |- y6 z6 B
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out9 j- `6 h" _) Q8 D8 l" N
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)- J7 z; X; q  u2 j4 l; |
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
% y/ f1 F: q5 a* K0 F  k        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)3 S3 j" Z+ e7 j: Z% r
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
: z4 |( ^6 O4 _' h3 g6 z        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573), S2 M+ T: w) x) r0 m! K
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
. H4 h7 s& q  R  R" W* {        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
) K2 f6 R1 h9 ?; h4 A+ _' e: Q% b5 f% b( d3 Q9 P3 Y# k
) _, @: `6 T/ j8 G  h5 B6 K) A
错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)6 Q( g# r5 p" t. P- K
解决办法:清理、增加空间3 W0 Z, O/ I- p7 l6 h& J
0 X) g8 O. a0 c3 ]

  [1 [& R4 p8 z' s$ b/ D# Y" m
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
  h7 w  t7 A& q错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
( l# ?$ C0 W( H  Z# |
: C/ e  L, i  P郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
1 y: S0 k- t& k( w6 \9 o
; E  k0 p! q: p2 X) Z. A* r& b/ t这个问题告诉我们,运行过程中的监控很重要。1 O" L4 Y4 n& x% S9 }
学大数据 到大讲台
回复 支持 反对

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50 * g8 T: d$ ~1 c* N
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
0 @! {" B# L) I$ B) V5 b7 t/ Q5 E
不好意思,才看到。
5 v$ y( Q! A7 V+ E- B2 f/ y我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部