锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 73952|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
' {; v0 n. F: `9 b! W" [( r" j) T, Z: h1 E8 z
收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。& Y5 V/ E! |6 Q
3 `. d+ k( a5 R' r- G, t
错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后+ h, W  V, G  i  l0 W5 [

, K& J+ G9 y) A# d: O5 t2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90004 E- |* V0 i* a2 K0 a
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3/ B7 }, {4 d7 x, K5 J  r
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391), n# i. w1 K' [# f
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)3 _* \& \$ T7 S  ^
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)9 p; q, S, b2 T. J1 j. k& ]6 ~
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
1 D, a3 g( p2 ^; R1 W        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808). [7 d* p: l+ h2 h- B* M% K
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
# v8 t, [3 R% P' ?. z        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
% N( c) R' _: X5 P        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
0 H! S' G% Z7 J( [        at java.lang.Thread.run(Thread.java:722)
8 M; C9 T8 T3 v1 |& L2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000- Z; q/ ^  m  |5 D- f7 i6 e6 H0 n
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421); P  i2 K( C0 S" i& Q/ O% ]# l
2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
  N. x/ h8 w% a; \
) y+ W8 T( u0 B/ ?, M% n0 v" @# E
8 P0 S7 O6 H9 d0 K$ P原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.7 w2 q2 f6 S) N5 G0 j

3 X+ L$ e5 l4 u3 C6 x解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。1 \; q/ r7 O' O: `, P' ~

. [4 ^2 G5 X  G' `7 e" G2 A5 |' i9 x1 [9 A5 H
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。8 Y( ?5 ^# ^! v% B4 t; F
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2015-4-8 14:59:49 | 显示全部楼层
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
' L$ Y9 B1 {7 _8 @2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
( k7 J( T7 G2 Y, ~4 B$ d' |java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 41.
' [3 U: n1 L; I+ Z        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)2 P6 ~/ K8 b1 V0 _8 O" g
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
* o; t6 l- M) T' e# c# J        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)  B/ j& [& R! O# ]3 T" E
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
- K2 D& U* [% O        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647): s$ ~6 x) V6 U9 F' A# W
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
& d" j+ ?* U9 E- ]: i. B        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)
3 `1 G: k$ g1 c        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)- F1 R' V. |# B/ a$ q
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
& ^% K  `8 T$ c4 g8 B        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)
+ c: L3 K& j; Z" {9 ^        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)( ], r' R0 ?/ v  E$ x$ _  x
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)) J) {& \" B+ j% `( i
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)) u$ I; h, [/ Z6 E" D7 l5 Z
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)
( {' a/ l5 r- X6 i2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
( T5 e- J* a% t5 U5 _/ M* ~
& z7 D) C& y; S& y+ Y原因:namenode元数据被破坏,需要修复
' b5 D8 F( n2 y+ s6 b解决:恢复一下namenode
3 C0 L  G4 c7 i         hadoop namenode -recover3 ~4 S8 `8 N2 z: F% E# }
         一路选择c,一般就OK了
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50 , C2 R; z. {+ H; B# h, K5 s
兄弟,你最后一个问题是怎么解决的,有解决方案吗?

& G2 o1 ?9 ^' Q+ @  Z不好意思,才看到。7 c2 k! g- M% M2 p0 C
我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP  \' l" p; S" y0 e9 u6 W, y4 i3 f
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
/ h6 }4 n' B1 q& {. n, ]/ L+ U# B# E3 F. ]% Q6 r
郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。5 i8 y& z" v7 d6 J# m0 t
  ]1 T& U( K7 S. H+ Q
这个问题告诉我们,运行过程中的监控很重要。
3 ]/ o$ z& n+ o( K' f. {
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2
. r3 G/ r( Z4 Z) x+ {1 ljava.io.IOException: Spill failed
- S9 r# S5 S0 [  ]        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
" ]; L) u4 r% N) O' C        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)
& S- P" ]* q/ {: @6 B& n        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)/ \7 r: A6 h+ N1 E
        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)
0 {$ z# \) h3 q% `% i        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
# E, a8 n- v) N8 L$ |" w% w        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
; ?$ c: F- P# \5 ?2 C  T9 Y# ~        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
8 h3 Z& k; A# W4 S3 k4 {  n        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
: e6 l6 h2 K# F1 Y" o3 _        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
& |; G6 G  R/ P7 z3 l) p( }# D        at java.util.concurrent.FutureTask.run(FutureTask.java:166)/ o( u$ u; l& f8 R# r8 V/ w
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)4 y) ~4 L4 n3 O( w; E2 `6 Q, A# w% }4 t
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
1 r0 E; E" Y( t. v1 [4 P+ e$ l        at java.lang.Thread.run(Thread.java:722)
9 d; n. ^9 [/ QCaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out5 a  Y9 {' L" T# ~1 _
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
0 G% C2 q  M* K  k        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
# t0 _) [: i, ~$ j3 S        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)% K; y- l1 Z+ c& O2 B
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)& P$ C) h: k) Q5 L* O( U
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
/ F. g4 i% m3 Z        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
# H- e7 X9 `( V5 J4 I8 _        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
4 _# q) G. Z! e
! {. L0 I$ u+ Q. U* Q% ^: v* Q, e1 j3 j5 M; e% G* L
错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)% x5 d' N( ]  k9 e/ y2 C
解决办法:清理、增加空间" q$ f  l- ~. W4 [. C4 R/ V( E

: l) ?7 h- ?! E5 @6 R* f: F$ g" v7 T7 k
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑
) U# s8 n6 `! ~; Y- j5 k
) ~. {" O! O3 u5 L错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for & U3 O# U  Y7 b8 q5 ~' [1 [9 ~
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
) C/ t0 C4 I0 G, Y2 z! _1 T5 T+ uError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out! ]  ], P% ~, j: L$ L
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
3 S7 ]  i9 m' m. p. D        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
. b" @7 j' ?4 f& g$ f) s- z        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131); t: R) S! W0 B1 {7 t/ V
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
! t4 P2 D- Y; Y) q5 C2 K* C        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
4 A4 U* y) {" q0 k; c        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)/ \) I0 B/ ]& a" \: ^6 Z
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)' U% d9 ~+ w; ^) I6 `! X: ?
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769). r' K3 v( m* W' [6 e6 T
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339). P, ]5 g; \: Q( b
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
- m" H, i. t# X* h( r8 X        at java.security.AccessController.doPrivileged(Native Method)- n( |8 c: p* f9 I( y/ [4 W
        at javax.security.auth.Subject.doAs(Subject.java:415)% [7 g2 j. u3 z' {. N
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
3 g0 `3 Y2 R: l! k        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)/ S) X4 ], b% z8 p7 f* ^
' e# w& s2 Y5 u/ x: v+ V& o
Container killed by the ApplicationMaster.
# }  I( G% Z9 T: g" e& a
, s! G5 A' P7 ~% K7 `4 K7 I# k5 D原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
7 \; t* ^6 E& |) |( c* D! `; n
8 Q3 c% z% p2 X, ]  i1 x" U解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:* t1 Q3 _7 m' V3 T) f
<property>) w$ z% H. q7 O3 t$ W. f6 `0 c
<name>hadoop.tmp.dir</dir>3 T3 G5 m  x: J: l/ g
<value>/data/tmp</value>
0 L9 d& Z6 o( ^2 L7 S" ^: [3 O</property>
/ k% S% ?' b$ {' y然后重新格式化:hadoop namenode -format/ N4 {) @* J3 f4 V4 P
重启。7 A9 o: J7 F$ a2 U8 Y

: h  b. m5 [2 V1 |
; I1 f- b, u/ J& L
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
4 B& B+ s' C. b! p2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
3 ], k- x8 {7 eorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
, _$ `! ]; b  A        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
. P: i7 s1 o6 I* f( N8 q; y        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)  F# ]5 E9 n% S$ i
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)5 P) R; d- ]% [7 Z* [7 U- d; {
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)9 i+ l5 m, D1 |
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80), [# Q' t0 T% C  ^
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
/ f. A1 `1 O. D* o$ m$ P; \" X        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
+ b% n* M& I9 I/ G7 R        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
: G* @8 k: }5 h1 W( w        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)6 K% K7 O. @/ I# L9 i$ r# k1 p* N
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)8 L  R+ ~9 ^) _' \5 Q9 `# A2 f* C7 D
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
! f+ G) w$ W4 `; c        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)* L( {; A& G9 p3 N- l
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)4 P$ b- n  Z/ F8 S- D0 a% t
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)! ~% X* f4 \- B. a' B! g6 U
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
* @6 x! {5 a  w: P        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
7 G1 U; v5 d6 M! k  A) q' Q* y- e, A0 f& C
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。( D3 S7 O$ C2 B7 w+ }6 c

4 U/ r' q6 ^9 ~; T; a解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:7 o7 G: n: b: x( V1 \* X0 A
<property>8 C/ e. P( v5 W. ~6 C  g' d* l
  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name># N7 e+ R9 C& a! G) o
  <value>true</value>, h3 X, Z/ [$ a1 e5 ^! ~$ e
</property>  J8 Q0 [0 z' U, S3 F
<property>& e. G3 v! ^/ o& x$ k
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>6 {) `  A9 e/ i7 c( V* i
  <value>NEVER</value>
, w4 _0 C! k; P' r</property>$ o6 a  O9 ~, n; m( p6 J
# I( v5 b9 [( }/ f- j' t* z! q9 k/ e
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
6 j3 j. V/ s4 R3 G/ a- P( ]2 S对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
: a& U; a: H$ I# A  k; o, k! m" s& G5 |
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation
1 D! M; h0 n% |7 l- G' K2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
* y' i" I& e% _, M9 Vjava.io.IOException: Premature EOF from inputStream
" C7 g- |+ R6 L/ ], a        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
, A- b5 f4 Y% W$ g9 T        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
! m2 P3 O5 Y, I* G        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)- Y' J) H) {8 g! ^1 J  W+ z! P. V
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)6 [  l8 l) B# l! K, H
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)6 }! ~9 \7 v# Q8 O8 I, u
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
2 \4 H. m$ a# x7 F        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
( p2 H6 i) T. l2 C6 D0 P, X        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
1 ]6 Y4 c9 ^1 x& s6 I+ x& L. V8 r5 ~        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)7 N2 [0 l5 [5 e+ I# Q! Y; m
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
. e& n" s4 ]5 H$ ~( i, K        at java.lang.Thread.run(Thread.java:722)2 e2 G. ]. f9 G$ t# a7 E4 f
# y4 P/ {, t# a; E5 i" q
原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。9 d# ]2 {- ?- Z8 ~  ~6 ~

8 K2 N0 }# x* M/ i& M7 m解决办法:- b4 t, Q- r5 a: Y' D0 B
修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):
. U& w. W& ]" l( N8 `) Q, p- `<property> 1 p  T6 Q1 ~3 w4 x5 e
        <name>dfs.datanode.max.transfer.threads</name>
1 o5 V( t: R2 o7 G1 w        <value>8192</value> 9 [) B' J# q' t7 b; z' D, P
</property>3 S2 m* I8 m0 N! x
拷贝到各datanode节点并重启datanode即可
% H/ T3 Q/ ?5 z, a7 M
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑 / V7 G% B1 h8 s3 K7 G: w7 Q6 B' Z

' d1 Q! m- F0 R5 y8 F- x错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
" }! j/ B1 q) S3 i& I  ^2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
' T# t5 Y) A6 X" Ujava.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]* E9 u. [" I- b1 n7 K8 K$ D$ c
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)* W  v* I% u' g7 `2 f, A
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)% |. Y; {  F. F. b* h* O, \
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)* d7 L, N8 \- h- l% K4 \  M& G1 b3 Q
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
! Z4 ?' y0 b* H$ H5 k        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
) M7 |( m, B$ l" z0 G' {        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)+ E6 P; g+ f% g1 C% y8 ~7 _; Z
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
' E4 u6 T3 {' p3 T3 K  I* b( d        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65), a  ?' j3 o: ?  A+ \7 U
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
1 w8 k, Q! M; J0 I& f5 M+ q1 f        at java.lang.Thread.run(Thread.java:722)
( s0 D5 \, k3 L0 [$ J
( I# e1 K4 x( Z/ j5 \原因:IO超时1 ]5 u7 Z6 T# c" T7 f4 p* ^

! U7 f/ j' n( I$ X5 J; E( j解决方法:3 R) J3 Q! a6 M3 u9 k4 u9 u( b& r6 ^
修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。- Y! c5 Q: {/ ?6 D  c
    <property>0 K5 V% m+ h6 u' k
        <name>dfs.datanode.socket.write.timeout</name>
4 W5 [( w/ q- `        <value>6000000</value>
8 a# w, e7 C, c4 ~/ @4 v    </property>- B1 L2 t# e+ l. u" m8 N7 Y
! S" ]- x7 ^) c$ P
    <property>/ u; T) ?# |- Z" D  D& J
        <name>dfs.socket.timeout</name># z5 _. s3 d$ Y4 v% Q1 q
        <value>6000000</value>
' W# K1 U' ^, ?4 g    </property>
0 c9 v# b+ o' ]* {/ W+ K7 v, s5 h% x( i& p, w, ~1 T" ]$ f
注意: 超时上限值以毫秒为单位。0表示无限制。
+ R$ |% Z+ ]1 Q0 v. q: y6 s6 c- T+ L
( C( q2 J7 u" b0 @" I4 A" P

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑 0 O' ?, \2 h: z7 t) B' Z1 O

; k2 h8 Z& v. F9 |+ b$ g6 M错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
3 w/ V% s2 w8 U* u) I! U8 Y) ~% ?; s# k& _
14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. 3 \$ q9 b# K/ v7 v0 V' y
This token is expired. current time is 1398762692768 found 1398711306590
/ \) y/ ?0 [! F- w/ o4 _- |        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
' y6 W5 Q- ]# k$ u        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)# m; ?3 j7 }" ]9 t( \" U
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
* m3 y8 X2 P( z- F. n+ h        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152), P- f7 M5 A2 [7 D
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
2 D; z- j3 r4 h' @# M9 b4 b. M        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
4 h' t: R( S/ n1 x0 w! l" B# w) P        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)) x$ R1 o' ?2 g$ S, A
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
) X6 L( V7 K$ M1 O# p, N, h        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)! k6 v$ y0 F4 ~1 d8 ]+ }9 Z% }
        at java.lang.Thread.run(Thread.java:722)
/ y/ A2 A1 P0 t% e8 F- m. Failing the application." q  z/ b. y0 |8 @
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
, U, h4 n- h7 ]+ |+ s! ?9 ]6 w" A& Q" O: p
问题原因:namenode,datanode时间同步问题
' c/ Y' \7 k2 O* U3 v/ z4 [( x  d6 T. c+ N* t; [
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
$ s! u/ s. r7 h: r3 S" Z/ K最好在每台服务器的 /etc/crontab 中加入一行:
8 q% w: M& t' c( t3 n0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是专业的云计算社区 —— 锋云网(sharpcloud.cn)!

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

大数据运维&云计算运维群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部