锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 65875|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
. q% q9 r" h  g9 I0 y! I$ z1 n" A( j5 y% ?/ J  ~* f
收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。
. E) Q& ^( G0 b
' v% r9 ]) n3 ^4 B( A! h. T5 a错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后/ j; g7 z: I$ y0 I' f  d
+ q  K2 ], i5 a
2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000# i, m" V* Z6 u3 V3 e. k
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee36 H* l7 W6 a3 b' N! ?- u. `% M
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
- O/ r) n% H* g( V8 v; ]0 o        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
! G0 f7 X. o: [, Q8 q& w        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)- a; z: J) Y! g6 ?: N
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
  S$ t: p( k' h0 k4 Q7 U; A  d( `        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)2 L9 ?) P- y3 [/ g
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)5 G# ^$ w5 A/ o4 K; A  s  D, \
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)* g$ I5 m) b# n6 T' R' O
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)1 i, K$ N$ d( G
        at java.lang.Thread.run(Thread.java:722)
* f, d  e' ?8 }5 U5 p7 W2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000# W" e, f6 U. f, g( @
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
2 S0 w3 W5 r4 Q8 g$ c2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode6 Z8 v- c# V4 G0 z; Q

3 b/ \4 H; Y  J( v
( \+ J+ d) w; x9 L原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
: z- \7 H3 s- F( r* H
4 L6 x$ g8 l% G解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。1 J7 T6 }; q8 m1 M- I, ~- [
& `( a. y  g( S  N; r; R

  R3 I8 P: h; A4 O9 k另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
. o$ {8 s  s9 S* M2 l+ r
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑 ' R  y8 N$ `* {; J$ L8 u

1 C* F; C6 C* d% o8 K错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
! [$ J( ^9 t8 c" j* q
& @: i0 w- G7 p0 F; R' ?8 O: t14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
/ D) w8 u) n. R, kThis token is expired. current time is 1398762692768 found 1398711306590
# v. f0 {9 y* t) `- u1 @        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
9 J) s- m9 L: x6 @. P3 }        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
. {3 a) H$ |1 C" l6 R! u        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
- X2 i6 M7 ?$ j0 p( P  M        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)) @5 Q$ |* F; v( k' |8 h( ?% b
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)2 ^% P; }4 V0 f
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
( _1 y4 Q) ^4 |& S  v3 s        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)4 Z' r6 S3 J, ^9 F! @2 b" t! R* T4 P
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)9 \; W4 O& `  P6 t" A
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)/ R  `; d3 J$ M$ y- j, |( m
        at java.lang.Thread.run(Thread.java:722)
% N% m7 ?7 {' k. Failing the application.
/ p7 b' P4 k1 l9 ^9 v14/04/29 02:45:07 INFO mapreduce.Job: Counters: 04 N3 O  \9 m8 J+ Z8 ~; ?
+ z: ?5 X, M0 a+ g( d
问题原因:namenode,datanode时间同步问题$ i6 O) E$ x) t' Y
) J- u1 I3 x# Y% S
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。" e+ Y7 t: f1 h
最好在每台服务器的 /etc/crontab 中加入一行:0 T# E; h; |& H  E6 ~$ J
0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑 : B* r# _. K$ ~, ]+ Y% p# }
; d; `+ U- ^0 L, g& m4 l8 }
错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
' \8 G+ y) Y7 Q- E. m) m& \2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
! v/ C$ g2 \) |& p0 A2 V/ sjava.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]
* b5 _- s5 m# \( e        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246); j, E: e% h& U
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
3 R1 _3 B! a) o- F  j7 P0 A3 b9 x        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
0 D3 u5 Q* u; d0 f8 c4 d2 N        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
  {8 a. i+ B- Z8 \% d/ U        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
$ g' G/ y0 l5 y8 o; j) l        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)8 m+ W% Y5 @0 G6 H: S7 Z$ N& z
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
2 t* T4 h' o* Z( O; S. o& V        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)' \  j8 U$ f, k! h. k
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
/ v" _9 q6 W1 F- W! T: h        at java.lang.Thread.run(Thread.java:722)
+ M1 \% q3 M  q0 x* r8 t  }  n5 G: W5 R, Y4 N2 l  B. ^. c
原因:IO超时
4 R& C  g& Q/ G2 [, @- s! X( l6 |) ~5 D
解决方法:3 J( [' G& C# |: V& h% c
修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
* l' R9 L5 Y; a    <property>' g  J' G/ B+ [  r
        <name>dfs.datanode.socket.write.timeout</name>
7 c  M4 a0 N9 a/ A/ O  H6 f        <value>6000000</value>
; f8 N3 J/ K1 ^& K6 d+ \8 u* H" [    </property>! a3 m! a' E9 x7 [$ i+ j

0 j  T4 i% d  c7 ]: ~8 N* }    <property>, q. S7 k5 h4 Y" _, h% k
        <name>dfs.socket.timeout</name>1 Z$ @$ }3 P. E) Y6 U; M# K
        <value>6000000</value>
9 Y7 T* _" K/ K" }6 H    </property>- h2 x' \% J+ P2 v" P: L* P2 O
: w( D9 Z1 G8 [
注意: 超时上限值以毫秒为单位。0表示无限制。; J( G& E* x! r& P& {8 l
8 s+ V8 ^1 h$ n7 y

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation" u0 `/ H7 u5 B+ L. j8 m
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
: H6 M+ U) P" f2 U7 H3 Pjava.io.IOException: Premature EOF from inputStream3 p$ f' C' r8 J+ [7 q
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
& ?9 Q) D8 g0 ?9 x3 b% D        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)3 V0 ^# y8 t  o
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
! m* L/ C2 u$ r) R! j# z8 I        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)/ m( @' k5 T8 y! G' Q
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435); g6 P0 `! o( i0 G
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
& m% p7 Z' l: J3 A        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
6 {; G1 T) }" y) q* I' M8 ^        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
: O* E! H- z9 B0 C        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
* J( T- E2 ~- ]" S) T        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
* F' u8 C$ X- B" U6 O5 d0 P        at java.lang.Thread.run(Thread.java:722)4 C$ ]4 h5 c$ N" N" H/ b

% D) ?- ?4 E. A, f* C原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。! D0 ~$ g; x* n/ s) Z1 G
5 I$ W% A6 J3 Z. H& {! O
解决办法:
# C& A+ _# ]6 F  F2 E+ i修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):2 ^" c' x+ |" P3 T  M
<property> ; B+ B, M: C' q6 j0 P
        <name>dfs.datanode.max.transfer.threads</name>
5 _2 j$ e. Q6 t& K6 v        <value>8192</value>
! q) J% K5 h7 Y, C7 Z6 y; e' [5 P</property>+ E% {) M, F( l3 @& M  `$ C
拷贝到各datanode节点并重启datanode即可. E1 \9 K7 h. m) X8 t! P$ y4 @
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
0 L" f0 ^: s8 r  ^. q: ]2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed / k& A, L3 l' S: `$ ]7 M
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
9 G, R+ S, Z4 E) l        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514); y9 ]. Z0 l4 Q3 a- C0 Q
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)3 I9 \  ~- g" U0 [6 \
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)1 P0 f3 K1 m% c  z8 h2 L! N
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)3 M6 V9 T# J* s: R& C+ q
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
( F3 d, X3 f! M, D/ M& _6 L        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)5 S/ Z$ {8 H9 P1 F! g3 U4 J1 \
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)! H% a, V& \) U
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)# r! H" E6 _7 F4 z' A2 D8 w6 }  l
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)" ^' ^, ~. m6 d" |, R9 M
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)
: K3 u/ h8 t0 g, C: d! e: ?* CCaused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.$ u* Q0 a/ T. `3 [8 u
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)1 p' z, e9 d4 q8 _# l" D2 A' n
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)+ R* O. w9 I6 P$ g- L: H3 i1 e
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031); [9 k7 L2 C  |/ `5 `9 r
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
) M( Z5 v( S# A        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)" X1 H! _, m" T* T5 y7 v
& A7 r5 ]0 D2 f  H& M
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。6 `' C! ]4 V9 O* l
0 U0 J8 y% T5 W) E( y
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
& J) s3 f. A% g<property>" @' Q+ W4 F  q2 g8 G
  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>- S8 {) {  i/ w% q
  <value>true</value>. J' p6 _/ b7 b$ |" B
</property>
) l5 I7 {6 t  ]<property>
+ g1 U7 x4 H2 D; N  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>& _4 a2 B( L7 }9 g- e
  <value>NEVER</value>
$ J/ d3 s' |) z/ h1 E</property>
2 x7 \" Q  A/ G8 ~8 |. @8 u2 U& G% I% J4 ^, t
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。! z6 V0 V- c4 J! [- h; G- [+ q
对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
1 g6 g& t; n8 l- G6 n0 [
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑 ' y. y% K9 K) `8 _

) e) @: c* O6 J错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for
+ u: W; d' C, y' f2 E* {# ]14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
! I3 F2 G% a# GError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out2 |) u1 d, h- Z) D/ ^
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)/ N3 o1 i2 ~: x+ W; d1 @* l& Y
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
; p& S1 x, O( S' z        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
, x0 I. `) R# f+ A, o        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
0 q3 w! p2 ?# v. U  y. _4 ^; {        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)4 \: o9 P7 {  H- q; Z
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)# L) q/ J% p; d& w$ g8 q1 f, G+ U- _
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
* t5 ?6 @# k7 t# W/ O* S$ \        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)( e- w* _( c8 Q! p: c) W& a
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
( p: y4 U$ Y& _/ p        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)4 W: i3 Y6 u" j
        at java.security.AccessController.doPrivileged(Native Method)
1 b1 t7 s. b. C5 _: i7 {        at javax.security.auth.Subject.doAs(Subject.java:415)
& ]  b3 g1 T) R; ]* F8 a# e/ J: A        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)9 z" R; O# Y- B4 g3 C5 K. e4 v
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)( d( J0 m9 M& j

% T7 f  t* @8 Y7 Z( sContainer killed by the ApplicationMaster.
, J" @  @6 R" ^7 S( ?" w3 i" {, {
" {. y* b' B, D- t' I8 }原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。5 {2 H. o( B3 [* x9 d

: s/ A) \1 P% r/ d. [解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
9 D: J+ u2 S7 Z" z3 C# i<property>3 {* l/ H8 b: J
<name>hadoop.tmp.dir</dir>
  F7 A0 \) d' _! G% f6 g* v" I<value>/data/tmp</value>1 e, R- J  U3 f
</property>
" N! ~; f% T; R, w2 P2 Z然后重新格式化:hadoop namenode -format3 Y: F: w+ [2 ~# y" q
重启。
. B3 ?* L: o% }
- e$ M: j! y/ @+ X
, G6 w; c# y( v7 f8 w
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2! R+ _9 z3 f' {7 x$ q* w
java.io.IOException: Spill failed
9 n+ r% Y1 l  g  j        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
4 u2 K+ [. E% W7 F7 R        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447), r- B0 P/ q# B! m4 D
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
; z  l+ J! t: E0 S        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)
1 ^& _  |8 n/ N3 m4 g        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773). c- s: O5 L( C2 [# {0 @
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)* u! J: P- ~0 N9 @, J; M. l6 _
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
( y- T1 `5 ]; u- }        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
, V6 p3 g7 K0 }        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
9 M1 `  ]5 T: t. J        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
+ x( r: C9 z  L8 O% o! o        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)8 K& H8 G+ C( Q9 V. L2 J
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
, T" G7 ]9 C' D        at java.lang.Thread.run(Thread.java:722)5 o+ ]4 t5 \$ X8 D! q$ i$ d/ @& Q
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
* H( E7 `9 g3 ?! C  q; F1 q        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
0 b' C: A: M( A        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
% d. j8 ~* a6 Z+ a! R        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)' z  p6 e( ^2 T
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
9 a+ ~4 [0 R8 O; S9 E        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
- o; Q; F4 q, ]; e* N, n. l        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
* B2 ]* v0 N$ ^! E( s. Q        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)8 N- f) n. r' i1 `" G+ y
2 q* {) d1 i; y
" p4 x/ {3 ]0 q: K! P) f7 @2 J
错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)7 d! G  Y1 M/ K2 C
解决办法:清理、增加空间; c( O+ O9 Y2 N# j( Y, o# O
2 s( S" U, O, Y
8 @& t3 g! h) Z* X$ E
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP. P# |8 @- [4 S& q6 `. z% d6 @1 x3 ^6 p
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。) N0 S$ O. ?6 b% ]+ N* j2 [

3 k4 R' s/ t9 h郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
) j- m, g1 G5 T& N" o, w9 q8 f( W9 ^: V% v
这个问题告诉我们,运行过程中的监控很重要。; G8 e& y; X0 K4 @$ s7 t
学大数据 到大讲台
回复 支持 反对

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50 4 d, o1 y  \, Y) F# b5 j
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
/ a% _; P( M( m" S+ P5 S* C  p
不好意思,才看到。
; D1 x$ y. g! f我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部