锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 70686|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑 " O( ?- P+ u* s. D9 j" l( c' `7 f

# V0 s, I7 v* _收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。
8 l, Z; V& F5 Z9 q" n* f& P+ |' j  Z
错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后
2 ]1 y! r7 V. P* z8 @9 U
& b4 f( p/ _' l2 B% Q2 Y2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90001 M+ A0 x* V" |+ d6 v; ~" D
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
  h2 J- ?* }# `2 g        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
) ^, t9 i1 U, ^; p! M. U4 b        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
7 I' d$ K8 f: h6 Z+ i. h4 F        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
3 ?8 K  x3 f* q$ H8 R        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)% L8 F; O& \7 H1 O( O( h* a7 O
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)0 n  d: O+ D: y  D
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
* b" y2 @, M" p9 K6 ^0 G: K# T        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
. b9 y# Q  ]# Q( E5 I3 o        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)9 Q; @# d' v+ a5 H  I" z( F) S
        at java.lang.Thread.run(Thread.java:722)$ n1 @7 @  ], f! J7 y! ~
2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90000 b6 L6 g5 R6 V/ _  r# o7 X, j5 D
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)% q6 D: G& ~! ]* v
2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode3 D, e' U3 b0 D* U
- a+ E) d& D! _) r3 a: w% m

5 M6 l% ~$ K5 h+ ~2 B原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.+ K- L& E4 J$ w) H( n4 w* g
* u9 y: Q4 u4 n, A% D) c9 t7 T
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。
& ^/ ]& o& X3 x8 I. ?4 c
# x* i0 j8 F2 N5 t+ k  W/ ~$ T: O  f2 W: z( Y1 S
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。- [+ S: \& |. a/ L+ B( q, d- E
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2015-4-8 14:59:49 | 显示全部楼层
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
) T) ^1 d& ?( O% [8 b2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join5 k+ l" y$ ~: e7 Y
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 41.
# `+ S5 `% u. F/ Y0 {. t! ~        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)3 Y) H6 |! _% B- x. ?
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
* W- `& s: ~% ^! O& H0 ?1 i( m5 o( P- n        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
& _! P, Z" E/ t1 u( M3 t$ Q        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
( s  Q! N+ A* t: f! c! p        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)/ J7 t) M8 l; v* U8 Q  w5 ?
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)6 H  n5 R5 _$ L0 b
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)
$ t% q; r# G3 N7 x5 P7 l& M+ P2 s        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568), e- X% F9 x0 u
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)/ Y  X, j4 H3 d: v. A
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)$ S) t9 x8 F: v5 f/ O" _
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
2 v, p: [1 ]1 Y        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)8 q- J- a! e; b( r  \& P
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)3 j3 L- R, ]- q9 z! p
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)# F% i' F2 c! b5 Q8 _9 A
2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
; \% C* c% \/ w8 H7 D. S# r) L7 W* z# B& E- O
原因:namenode元数据被破坏,需要修复
7 [3 f  T9 ^  [解决:恢复一下namenode
4 I, |+ R& P7 e# w         hadoop namenode -recover' O8 r: O$ e* B2 d* h6 V
         一路选择c,一般就OK了
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50
, t( C* A0 g8 d, I兄弟,你最后一个问题是怎么解决的,有解决方案吗?
/ N7 r+ r7 N5 O9 ?! Q
不好意思,才看到。
$ G' K7 i" j. \' N$ }( L, B我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP1 W, a2 K4 [9 E5 V* k* C: i1 I
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。% D. y. F; M5 N- ?5 B
3 \9 ?; t. ?5 R+ R. F  L
郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
* f3 l2 F3 Q8 O" W" z# M
8 u7 H' G6 T& a4 l! e这个问题告诉我们,运行过程中的监控很重要。
* _# w5 ?3 U% d2 t% ~6 w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f25 H8 G; b$ o6 Z0 D
java.io.IOException: Spill failed
, r6 G8 e% z. r4 J" h: R        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)5 @$ V% G0 v- p  A7 @
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)
  P5 X+ I: w. H! s% k        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
, G: [: h1 H" q7 F, ]        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)* p( Z( Y  J$ C, q( E) q, S7 ?; o
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
. l" g9 U, ~# F3 m        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
2 M: @& v! z* \0 e, b3 `0 _        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
& i' t' B1 E, N0 {8 W; v; }: |        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471): T; c2 P3 U8 |8 c& F2 P4 X
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
3 k/ z! u1 v7 m  j) G  F        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
; g' S+ J4 G  a% J! v8 z        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
5 ]" F- `( I- N1 A9 X% J        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)1 s. _# R9 K) P8 J7 _- s
        at java.lang.Thread.run(Thread.java:722)
* l4 Z% B0 |' E7 jCaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out/ J; R: T6 f( [# D  g
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)4 N: d5 ]; V2 o
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
' z0 j1 ?; y2 {. f5 J9 V$ s- t        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)  G$ X8 h% R+ I% T, [6 h
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
* O/ a: W  b+ w9 `1 c% M        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)9 D6 _! C& [) ], Y3 f0 [; f
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)7 v0 ?% l1 A- H+ U9 z- L
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
; \' b: U  a0 C4 {  F  p
) X; {8 [2 j  p9 x) T4 Z
7 P8 r) t% B! S% L' b错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)# T4 z2 A. v7 y* d/ }
解决办法:清理、增加空间
6 r" Y7 C; P8 Q$ P# }
& j. K2 n& \5 {% ~$ g( r6 u# W: ~+ B1 p. I
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑
5 P! x: P7 k, A- ^- D0 F
" s: G% q- V7 }0 |% b  D; b: |; o$ u错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for $ X7 Z& h1 X4 y; q, Z2 f
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
* D2 f" V7 {/ v, N( ]0 bError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out  R1 \& ^1 `* n. P
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
  Z: Q" h- S3 l( N/ J  Z( s        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)# ~5 @7 Y2 o/ @, a# H
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)8 N  z9 ^' O: H" r* {  D" m
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)6 N$ K& b. B( ~2 F7 K- m
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)7 H! e! z$ ^7 I' f
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)+ o1 U0 B% s! [1 H  W6 `+ H
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
" `5 ?& Z4 _' @6 ^; v' }        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)4 n. e( U7 N1 Q) F/ c5 G
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
1 p+ c3 C+ W  ]$ z# q; R        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
# d. ^# H1 e7 ~) |6 c        at java.security.AccessController.doPrivileged(Native Method)1 h7 a3 q; D8 G$ s2 E* X) f
        at javax.security.auth.Subject.doAs(Subject.java:415)
, `# A4 J7 g* |+ I, f) g2 D        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)* a4 A2 H. R8 u' Z, G% G
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
! o9 A" n* _! M& M  W- B" M5 Z3 ]5 K1 d9 g6 i
Container killed by the ApplicationMaster.
7 J+ V" q: _, [4 `' c
3 w  J4 e) @4 p原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
* C% U) b: E8 m* Q9 I& G" t, q5 ]. O  F) w
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
) Q; {6 U: ^6 p% k  o* Z6 @7 ?<property>
0 B7 L8 ]) e, X- d3 S, l<name>hadoop.tmp.dir</dir>
. k# ?* p) g- ~" h- F% i0 }3 d<value>/data/tmp</value>  v7 ?: O' O3 z1 O
</property>" B! v' s- s# p3 M6 x  w9 G" z
然后重新格式化:hadoop namenode -format
1 Y+ O$ J: k% S) a+ R; c) l, k! s重启。! o, Y2 A0 t2 [6 S8 b; ?

4 e% `6 F! s( [7 n  Q2 i2 q1 ~/ k" @  |+ T" n
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
: i" ?3 ]$ ]( a% G0 G) M  y2 ]2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
" {- X; o: U4 p8 g0 j0 y9 {2 Eorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.  m1 V: X! l/ [, D
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)% \3 h  n' o; h
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)
$ @- B# B8 |8 J2 x9 z        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
& y, v. W! _) c' p# v3 j        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
9 c; h+ U5 ~$ d) O# ?9 ]' D        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
  c" f4 b3 u* [5 I        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)3 o& r* Y0 c3 R: K  S# G
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
2 z& q3 q5 c- O! K. c: ~        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- d9 o4 a7 V: d# ~4 C8 K        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548); [7 x: a0 C% e9 N
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)
# p$ [0 w% q. _! m: nCaused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.' ]( R1 u. P3 B* l+ |
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860): s& Z6 Z7 Z4 I7 a( Z1 `
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)2 p6 ?- `% C& m2 v
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)) m/ j" Q4 n1 }1 N3 K
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)/ G; j2 E( M" ~0 e  J1 D6 [
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
2 k5 Y& I2 i* I6 d* G! _$ G2 U/ g& q* `! {7 g1 B
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。
8 K  [- L- L0 Q' ~, z5 w3 E" J- a! C. e; i0 {, ]
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
4 ^; k$ O  t- {$ d- e4 H- ?2 u9 S<property>4 r: B4 O4 R0 I8 X( J; ?* ]0 l
  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>0 Z2 d& s' G5 y! J
  <value>true</value>
0 x' r7 [$ Z0 \) K</property>
  k0 ]' G; U+ ]3 d% ?<property>2 C5 m1 J, z+ T5 \( x4 |$ e$ J( \
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
+ `2 i& ^# n, z0 o) x& r! g  <value>NEVER</value>
  F+ U+ }% n& e7 O% I</property>
* ~7 B/ J3 R* s& T  Y! u, G/ u6 g- g+ I
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。) x6 t# Y& r: R: e
对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
3 ^7 C& L3 d/ J8 Y, U$ a0 I& b
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation8 G% }5 x/ T  i1 P9 z; z
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
1 M1 p8 p/ m- Z* Ujava.io.IOException: Premature EOF from inputStream$ a% z3 k* k- }+ F3 _8 b4 M, L
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)% m+ g, F- y1 s4 h" D) U
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)2 c* ]. n: k: |7 _2 m+ _+ k) v
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
% v! E- T6 O0 `/ F4 \        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
8 [6 m, n1 M& W1 `3 U, C% X        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
! q$ r' I  \5 [% G        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
( |; T5 I( c; O* T$ y        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
" `9 d9 X& E" }, [8 G- N% J/ P' l" M        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)6 h) |9 ]$ ~* x1 z, ~
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)1 `3 b1 z( I0 |) R' ?$ D+ v% {
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
& ]" t5 w& U" F4 }# y5 M! N        at java.lang.Thread.run(Thread.java:722)6 _4 E% ~2 U0 B

' J. c3 }/ W( X: q+ W原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。
, t2 y$ o+ n! b$ T: r6 e
' j) A8 b8 ?% o7 b/ K解决办法:
3 }. F3 N5 X- c3 e9 ]8 ]# t3 b修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):7 e: j  O$ c. r4 D8 c
<property>
* d% S& j9 c% w. X) X2 z7 {        <name>dfs.datanode.max.transfer.threads</name>
& s1 Y7 x% `: m; v        <value>8192</value> 5 M/ m9 p1 \$ w6 E
</property>
. p' C: e0 A  J/ ]& }拷贝到各datanode节点并重启datanode即可
7 V* b8 `1 N9 H9 c
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑
' k# U! _4 y+ e
# i* _' p6 O; M1 s3 A  ?& b; J7 \' j错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
/ ?  T& P& ~; ]3 p4 K: C2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:500103 q+ O0 C( r1 o8 b
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]: B; {- h  b1 d/ @9 g
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)+ }3 v1 U! Y4 y8 P2 _' e& [0 p5 Y
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
& |- Z$ H: T2 ^/ y3 l        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)4 C9 |. u7 o# S1 w1 J1 {8 ^; I. s- N
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
; G1 @+ g: a- B- V        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
7 _8 I+ |1 Y; w5 N        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)) Z( A0 I6 J- g  x- V2 V3 d4 X
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
0 S: P- V( z% R9 [        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)- V5 ^, G9 a! I! ?7 W8 j
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)- e1 m, q4 S' n7 J" p
        at java.lang.Thread.run(Thread.java:722)  w) u( |( o) C3 A5 \. D/ m; t
0 h' p1 ]) H6 ?2 Y- a
原因:IO超时
7 V5 n5 K6 O4 c5 o" B# j& _, `7 b' A$ p
解决方法:
8 B3 l: s# Y/ g4 |& w6 b- E4 w修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
6 }/ {+ _$ r/ ?    <property>
- p' M5 ]: ?, s& S- o# t0 I) M        <name>dfs.datanode.socket.write.timeout</name>
* W% t+ K( E9 x" y* r, q0 M3 l) j        <value>6000000</value>
& a# A. G" W0 w; B6 ~' a7 z) d* `' M    </property>
8 e1 e1 }/ B0 y( z) \
1 ]& O' [* Q+ M5 k( y! S0 w    <property>( E  r/ L% D# v; `6 N7 N) k; J
        <name>dfs.socket.timeout</name>: ^4 w- ]: ]) F
        <value>6000000</value>' Q, G, `+ u: t; [9 M; P3 f
    </property>3 V, e  f( c, j  x

" y; @2 D. L; n  n5 {% o注意: 超时上限值以毫秒为单位。0表示无限制。
3 ]9 p4 c& d! c( J; d' o# U
" T& l5 a) ?7 n% h+ I5 l

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑 + G2 d( L  N% x! @- O

3 Y. G, y& k* t2 Q) t$ H; O# z错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container0 ^) }# ^/ ~# b5 k( U2 [

* |# u6 w3 k  `8 l) n; A2 O14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
4 {9 t( g3 T0 g- D7 V0 XThis token is expired. current time is 1398762692768 found 1398711306590: z) e! @9 r, u. f! _% [& e
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)$ D+ n$ b8 s0 y; l; y/ ^
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
: w5 g3 A6 z! u( O        at java.lang.reflect.Constructor.newInstance(Constructor.java:525). b" c! `- V3 |' {' }+ K. i
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)$ n, m) J4 Q8 f0 t7 x+ F; i
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
: P+ ?0 G  G6 d/ S7 b# g( F        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)* o' G/ `0 p0 X: ^
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)2 L5 v, Z2 T& [' _1 A; ~# e
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)8 c7 ?+ R- Q; ?% N6 h# J
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)8 k" ]5 l) t" v  F
        at java.lang.Thread.run(Thread.java:722). t1 Z$ f! l, p1 B. L5 B, E0 p& w
. Failing the application.
3 v2 b# D/ {$ G0 d8 S- H9 K# _14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
4 F3 x- J  D0 e& i% q. S
3 Y0 D- n4 Q/ d' d- @' V3 e4 O问题原因:namenode,datanode时间同步问题
/ l7 O5 \- A! E+ b3 R' `5 u1 Z: \1 ~5 R
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。- D5 U5 H5 M, o; |0 G6 c6 O  Z, U
最好在每台服务器的 /etc/crontab 中加入一行:" _9 C8 |: r) B* a8 b
0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部