锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 61927|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
) C, W) I5 q0 ?3 Z4 f9 D1 ^: [7 `
收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。
) p$ X$ [9 @6 G3 B
$ V; w* A3 L; C. P9 S' d9 u错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后, i, L/ L# g# @( ]
# f- r7 g+ ^% O6 x. J
2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000/ E; X) a! g6 b: g  {: ~
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3# }$ U0 a9 \' C. |7 q$ \4 o
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
( e5 a8 B7 d- V" p/ u( X( {+ K" j        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)  f1 F, d" d$ N- ]* F' E7 k
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)0 s9 @" s" }- }; o$ T  H) j. O* R$ T
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
( {4 E; d0 _$ i2 X+ L$ _: V        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
6 ]5 y4 r: V/ K        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)& m+ J4 v4 _; K/ ]% `" z
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
. A, B  J' ~4 E( |0 b# T        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664), n& D+ a; N+ w' O% g7 h2 ]
        at java.lang.Thread.run(Thread.java:722)+ w2 B6 _8 ?% E  m$ f  Z" w) Y
2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
: x) B7 V" k  h" s" T# ]5 F2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
7 T' o, O. Q2 K# ^9 L2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode& s, a7 s! {  [1 T& `$ P

7 z9 }: y6 I9 F( ~4 v) U, Y
. I& r+ E( P6 L2 W7 T$ O' ^原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
. ?( F; |* C! J3 R9 r  y
! ~" F( d! l/ S- K& s解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。9 @* Z" l; u' s3 U/ G/ d
2 J: S6 d9 z2 }  c, D5 u; |5 Q. c) ]

6 V# N/ L: V8 K$ Y+ D另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
$ |4 b  H. q8 T4 H4 L" I9 h, p2 |: ~
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑 ( i; e6 d9 h! f! C- N
( J) h4 Q. N0 f6 U, h
错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container& Y+ u  \4 a9 K6 W6 z. |$ n/ y

7 T4 `  S/ e& p6 Y4 m( }14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
' K0 ?2 r; Q' @) S: A& q% d- d" Y2 jThis token is expired. current time is 1398762692768 found 13987113065903 @# C1 W1 E2 ^  Z0 L- m" p
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)+ S( R4 B+ W) S8 [1 O4 @
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)" P3 n" V4 \7 S" Y7 [' u
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525); ~: _. S; i/ ]8 J8 M1 j7 c
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
. o* ~7 A4 c1 \3 o; W) U1 [) Z        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)5 V" R0 j0 @8 H# x; H
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)/ k- \; a( U5 e) E$ R* A; p4 _
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)0 w* X" N0 p2 P6 C: u: r
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)& Z- Z) {1 d6 u" `! K7 q
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)% h" N) R% _2 C( R0 d! ?4 m/ Q+ A
        at java.lang.Thread.run(Thread.java:722)4 o3 ~: f2 z* @8 [3 P. I7 @2 [8 Q
. Failing the application.5 n9 q" V  p- k- j6 S' W6 ~2 Q7 W* l
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
2 A* V. D5 J# @
& d% G/ c* R5 L" X问题原因:namenode,datanode时间同步问题; H0 P3 W) J2 d# o1 W( ?( _

) p. s" o3 G9 s: y% @( m/ ]解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。5 r4 b1 H: F2 e3 D
最好在每台服务器的 /etc/crontab 中加入一行:. }- D" I' H5 w9 `2 e! {* b' n+ U
0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑
# m( n* n6 d: s$ ^* y9 P1 }0 f2 s9 E# R
错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write( `. P! d9 A% g3 K5 |0 D# ~9 N7 T/ i3 ~/ g
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:500109 E, X) B- Q0 ~7 `8 u5 W
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]& a  h: J/ l+ n1 a
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
8 f6 t/ U9 T% F  w( L) U        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
8 p, H3 F8 b" I2 t% V5 l) e        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
; b; O! {$ ^1 y8 S+ ~  c) m  m        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
; |" q0 @6 }6 Y* s8 k$ X" y; o        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)$ W% }+ x" k! g, `2 U
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340): `) P6 |5 M3 w  a- W
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
- K9 n) `7 X. P# y0 S; }# ^  u        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
& f9 E3 |6 O" y! f/ Q/ b7 h* Q& Z! H        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
5 U& `# A0 C. m* ]) H" i/ N- u        at java.lang.Thread.run(Thread.java:722)7 k! i( f9 ^, V! J3 F$ S

: z5 |6 k: X& n0 c( J* x6 p原因:IO超时; q0 W: _  ?7 O; S" j

0 W, Y: @# ~) D解决方法:1 k) W: U* M, V$ K8 O; B2 m4 A1 Q: L
修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
4 R- U  E; T6 W2 \9 v) M    <property>
4 R( I( h! V0 ]3 \+ c  z        <name>dfs.datanode.socket.write.timeout</name>
5 W- ^( l+ m  A" f# @        <value>6000000</value>, p* `/ i% m2 O8 z2 O6 C6 z
    </property>1 _0 \2 X9 d+ k" V/ z8 U
4 {& K5 m9 b: {3 }+ |
    <property>
6 n; Y- E; U4 B3 U4 X& \. {- J        <name>dfs.socket.timeout</name>  Y9 [  ^. H' i  y, |* i
        <value>6000000</value>
  `9 A, d2 u2 u# u$ `3 y    </property>
' _( l! T; ]1 n, V: d2 e+ Q
) K6 R' P- B! x+ _$ I  O注意: 超时上限值以毫秒为单位。0表示无限制。
/ a( C, C, ?. `; b6 ~/ E1 o& k5 ^" t! p, o: w8 P

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation9 }: ~* T3 J7 j$ K  g$ o
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
- u/ d, x- r; P. Z: Xjava.io.IOException: Premature EOF from inputStream5 t0 L5 g1 B' \- e- [* |$ S
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)  O& k- J! Y+ ^4 C) c
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
% v7 A& Z3 x9 w        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)) w- |% l! G3 M
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
+ s4 G* t- g6 S' k7 V        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)( |0 u( L* K4 X  O1 U& J
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
6 x& u$ {% {" `* {1 J4 Z        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
  T* c* E1 f; Q        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)+ o" ^% n. q1 J
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
7 H# F2 G- P/ i( p+ R& q. j  P        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
: i/ ]3 i  B# D9 t7 C. `        at java.lang.Thread.run(Thread.java:722)
4 Y6 C8 d9 x# p4 m
8 ^8 p  t' q) Z0 A4 w: y' l, D原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。
+ X  P( e9 t6 c" e3 o3 `6 b9 y
% \8 B. N( y3 C/ ^  d4 A$ Y解决办法:6 x( ^, ]  n% W( R" L% K$ X
修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):
: g& K! Z9 H$ M5 f( v" @1 G! l0 _# |6 T<property> / N/ W( j8 k2 E; L& ~
        <name>dfs.datanode.max.transfer.threads</name>
. A$ k  T- e0 E3 f        <value>8192</value> 6 ^; n  d* H4 k4 B& r, J
</property>
6 Y) B) N" k7 {% V2 t' d拷贝到各datanode节点并重启datanode即可3 w- G. _$ Z; }
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.8 }4 h" _! E9 q+ K$ v
2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
4 @7 p- p5 E3 Z; n* Iorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.6 Q3 ~+ @6 R) c, C" [3 q' d! c2 q) S2 a
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)$ \. D; |( {& Y7 o( ]" P
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)# {$ ~% I( o0 r5 ]. D( r$ A
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)1 `+ [. s5 P$ I, ?
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
6 ?( n- |* X4 _% r4 g, D. v        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
1 o. |7 R/ ^% X        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
0 `' S' R+ V( `; h/ T* w9 w3 r        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)" j5 F. a) y5 W1 R; F4 s5 i
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)% }& C  l8 Q0 m  N' I$ W% U8 [
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)4 a7 }0 s: C$ R% N' @
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599). p4 E! L  M7 E; ^$ ?) M+ f9 K
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.& w4 f+ F  E5 J, X/ S# Z2 w/ W
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)
  V4 p8 O3 s% X) K9 r0 V        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)
; l) M8 J  \) O# b2 E        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)! z3 g4 X) K) Q/ S) R! I4 y
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
$ h1 k* ]. \9 U2 {1 N        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)6 D& @& C) t8 M0 k* o- S
3 y5 d6 H5 X& W. |' V! h. J% y) l; x
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。
6 \* Z8 L: K. f. T3 J! l/ o! O1 T1 z
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
; r& H3 Y9 w! ^( @, u' ]2 L<property>5 p0 X) T, Z% Y! J6 X
  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>5 f+ P2 Z+ T, H) G8 Q
  <value>true</value>- O9 W8 M" D9 u3 g4 Q
</property>
; F& m# }# e5 I8 ?) F<property>6 p+ d" @, @$ F* W2 o3 j
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
$ j0 T9 v1 Y2 ~  <value>NEVER</value>
$ M$ P5 |! C2 y! j' ]: Z</property>6 P/ [; n1 o1 Z% \- z. k. e
& B: |& X3 q( y0 R0 s3 X
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
- K8 ~- J$ K9 l& R, `对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。& x8 ?* X2 S: M4 ?* P  x
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑 7 }; j3 @' h8 D
  c# \- [) ~, I2 I' t
错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for
7 l. G% ]" A/ _/ }6 D7 V14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED. \5 X! k! {: d
Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out
  e8 H% m: Y( i' a* C: R        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)5 M9 S3 E. X: Z! K
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
; G- Q% a/ R9 T        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)6 a+ S8 w0 F. w0 `0 |/ Q
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)" c1 s& K4 T5 E5 a2 S
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
+ Z% n  K+ C/ t1 K  ]+ _        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
. J. n. A+ }  W: N1 t        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)8 H9 E6 B7 f8 F
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)' @- J1 y0 S. @$ w4 i. H
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)4 b6 S, x0 \  e/ c( |
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
  X: W$ L! H9 p( B+ _        at java.security.AccessController.doPrivileged(Native Method)0 _0 S% E0 ?- ]3 h8 j! \; t$ O
        at javax.security.auth.Subject.doAs(Subject.java:415)
6 a$ Q( L/ ]& x' x        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)4 I0 G* X) h. H, `
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
- v7 u5 W) s9 C- T3 r; |( ?. w7 u
$ |  E1 u8 d2 ?Container killed by the ApplicationMaster.
  L+ o- ?) x1 J  \
1 O- ^" v1 y( t3 F: h原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
! c+ {0 \1 G: Y+ G. P, i0 j; c0 h, U  K
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
+ Q0 U; {9 [: \% H5 g9 Q7 r& ?<property>6 O4 y* L. G- e& `. L3 r0 X
<name>hadoop.tmp.dir</dir>
, @5 i& d3 Y; N0 J% o$ U<value>/data/tmp</value>) o1 T/ W# Y; @' c
</property>; U$ X( F* G4 ]* G; S6 R
然后重新格式化:hadoop namenode -format
- ?1 S) i! a& t, t  F' C重启。
+ X. q1 \. `" q2 q, K+ a7 k' U* {( Y
, h9 O, _" M- |) [: \( |& k8 j; Z1 |6 K( z6 d8 m: r1 s5 c
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2  D( h0 b; J8 X( X6 n( f+ A
java.io.IOException: Spill failed. [- Y; R, n0 }: k7 J; H4 o
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
7 A1 {. u; t1 `5 H1 L- q) s        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)% s8 H1 @. m" o3 m4 n
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
. O* I" t9 j, a9 O        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)0 e, D& U; o4 S, J+ F, e
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
7 O, d( y4 C  \  l( t2 ~* B        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339), h9 K; M" a/ h) C6 ^6 U
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
$ s1 k0 u7 L! e: L- o, m        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)$ X9 ?5 S0 O3 L4 s
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
. a: U8 f5 u0 T! Z2 z1 h7 b* @, K, z        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
3 m% |% p( H* h8 |2 N+ Z        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)  Y8 O' t% I  ^+ d6 {! g( C
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)- ]# B4 y) E$ g( ]+ @( J* }$ _9 Z
        at java.lang.Thread.run(Thread.java:722)
8 r/ }$ n0 w! w$ B( s. V. O% d4 ECaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
& z$ _6 x$ x  P8 y3 ]! S* A/ |$ d+ D        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
. ^7 T) _' [4 u' e- @- r; w3 n; r        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)9 x# r) `. S& [* L
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131); {/ B# ?2 y1 I; M0 \8 I
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
5 l+ o% N' w7 N" |$ C        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)( r  N$ Y2 ^8 z6 ^8 Z, C
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)& c; w4 L- z$ s8 }$ B8 {) u
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
! z! D. k+ l  K$ j
# V9 R+ P5 |7 G) t4 y) s5 F7 ], A# a; |/ f. n& ]
错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)
( T6 H: S" z" a3 K解决办法:清理、增加空间6 x& I# D8 R4 ^4 \

" ^. ?; a3 {+ d! j( Y8 f' s( j
; g6 s- G( d! d* P# T3 ^
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
" R3 m' Y+ V- a; x错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。# x+ b- K# ?, Z

! M7 Q! h1 i" A7 \: j% A) P; C郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。3 z8 D, G8 l; j. g

! a& m6 S' l5 r0 _这个问题告诉我们,运行过程中的监控很重要。
2 L$ K! e6 c9 S
学大数据 到大讲台
回复 支持 反对

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 反对

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50 / G' @6 [; |. F( ~, o8 o% J
兄弟,你最后一个问题是怎么解决的,有解决方案吗?

9 }  _$ \; R' d% g8 S  D6 v7 ]不好意思,才看到。! m" K4 E3 x; H1 F$ F4 ]+ m* n
我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部