锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 69124|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
9 B3 G" F2 Z2 ^$ C7 j; e4 e; y# k4 |6 w* v8 p
收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。
" Q5 J4 u9 r0 x; A) I
, T* t) z) Z7 j! L5 D' o) f错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后' S0 ?) Q: d! Z/ |3 o( W* _% E8 W
* `! y" J, |- `' `( |& [
2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90005 L4 Q" p, _. C* N- Z: s
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
: [" A) A( A+ b; R7 w4 Y% L: O        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)  b0 z' `, ?' }$ ^6 y
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)3 p$ E8 {1 ^( X& u) _7 U9 h, C
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
. q! m" p% L* G$ z7 u" m        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)# A# A/ ~" ]) H" p8 X) q9 Z3 N) n
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
$ a- K2 u) s* d/ f        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)9 t8 G) Y" `0 _5 h2 \) |
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
# [& X1 i! t' p( j7 w; S6 e        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)6 A9 |5 Q$ i1 r0 `! a, |0 r( I
        at java.lang.Thread.run(Thread.java:722)
- ]' V/ d7 f/ T5 P; I2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000% j6 [* c. a! S5 ?, @0 R9 C+ T/ |! o
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
" L* @/ v1 d' N% {6 H2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode5 D9 x- Y+ \* q9 r) W2 C

* `- y3 ~4 Z) T& }) v$ `  ]3 B1 K3 |" R3 Y$ P0 [6 W
原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
1 U6 G! |" g& a, q, x1 V
; Y3 f; D& z; _1 y" J% k解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。+ y! d: Z( a: [% B

; r6 N6 r# U- u; Q. S7 R8 S- c. I
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
0 @2 B( D9 m8 L3 a2 W
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2015-4-8 14:59:49 | 显示全部楼层
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.: j, \& c/ i4 _" O6 q
2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join8 T$ X+ Z) w8 @9 F3 |( G
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 41.
4 c- S7 w5 Z3 r: D' f$ A) k        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)# P' s7 q! q1 i7 E4 w. O5 m
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)) N: N9 y9 v  W0 }
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)* {, V" o6 {- p( F! X/ [
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)4 ?+ Q1 }  ]+ Q' ~
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)2 K* z* ?- ~( Y, A
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
. ]- d7 J# T" s( _8 z2 Y! j+ o        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)* H7 H: r+ u$ b6 f- C( P* N
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568). k3 M9 M2 t  P% {
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
1 y1 J( ~! r9 y/ o, m        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)6 X/ m- A9 k4 r3 @: o
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
) k$ e: |1 a! o0 b. E        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)& U' G/ `6 {# u6 m/ V8 F
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
* I8 n6 ]. A4 w( S/ B/ M' t; g        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)3 d" b; d: }4 v* S9 _* x: `2 o
2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 12 v* w' C9 X8 X: a" x  B. Z

; y8 H+ H* C. ?/ X& R2 O( n原因:namenode元数据被破坏,需要修复' O- m% K5 L* H4 }0 B) H) P
解决:恢复一下namenode
4 P6 O0 [! O# k6 ^+ \5 Y# d         hadoop namenode -recover
  X  M9 N' w) B. O         一路选择c,一般就OK了
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50   ~% q/ p  C8 Q
兄弟,你最后一个问题是怎么解决的,有解决方案吗?

4 K5 b9 O  \& Z6 u不好意思,才看到。
) o0 M- ]" f' {) A8 |! F! j我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP8 F5 g& S6 e9 I9 z; R$ L. c
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
) r6 L( L) J; {" a
8 F& z6 r$ r; g9 \- M郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
7 O1 G3 n! Y/ s& q3 q5 C, h
7 x8 v% P7 K3 F  |( W) X) ]0 T这个问题告诉我们,运行过程中的监控很重要。
) t8 d, E; U1 r3 J
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2. T+ a1 [$ }5 a5 N  `( }
java.io.IOException: Spill failed* ?: v) o% I  j4 G9 W
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
/ T2 p/ v  n9 C; t  d( ^6 o        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)+ S# |& k8 a4 x5 I3 ?4 F: Q2 b) O
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)4 l' j: Z4 c. y( V7 u. l# n, l
        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997). A3 x& D/ _6 X4 a2 M$ t- @
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)0 M$ H3 L' {2 q4 A% \; u' c
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
  T, ?$ j0 M& y9 T        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
- P& v5 Q6 U% k: R# y        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
. W2 |/ j/ P6 i        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)' D/ j: |0 c5 [
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)4 ~- a/ P! T7 F' L6 Q
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
7 a1 w: P; A* ^7 B        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
- n. z- Y7 Y* e% K% i% [' ]        at java.lang.Thread.run(Thread.java:722)9 `/ ?) X( q8 b$ E; ?! ^
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
$ b- U* p& s9 l1 j6 n6 E. B* C4 b        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)% Z+ u4 h) U# B; F( |) {, u; z
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
2 n! v7 s; s7 `+ [6 ]5 O/ z* j# x        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
  S9 {3 M4 B! u8 p3 {        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)4 S. }6 h8 r0 I' q' D# Z+ v  g
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
' m$ k7 Y5 H$ r) X+ ^7 _2 G        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852); B) |- Y$ Z" ]# h
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)- n9 X, J) l' f$ p; I9 A
' l/ {3 M/ y1 h0 N* |; i3 Y% x

3 P$ q2 i: R* G9 H错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)2 X) E! _  n: F5 N, d
解决办法:清理、增加空间
2 v. h) p& v' D# I, C" N# X" ]
1 R( c( \7 V. s9 z, _# N1 v4 ?; @+ V* M: j; H. u9 ]; {: H
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑 4 A8 p* s3 u' A( B

8 ?" p$ W  r/ v# T/ m3 H. s& O错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for 4 f; J& B" d& k' d- U/ d
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
- F- ]& ]8 h  _+ s3 V5 j3 AError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out$ w  w5 ?0 n* h$ U/ V5 |9 P
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)# M8 l9 e: L2 ]' N* k6 g
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)* x  H' [" e7 w$ u; B5 f
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)" s5 S! ]3 P4 I: \; c+ d
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159), x; S( M6 a" v& {0 n
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
7 g$ L' K% s* g9 u6 p( y8 \* \  g        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)1 t4 b. ~# I; q0 K8 `7 k% d
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)9 s+ t* _% T  h  G1 Q7 Z% |  V4 A
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)
2 `1 a  _$ E5 x6 y* ?* z9 \        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
  f$ x9 \( [, n        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
" i/ ^$ |# J, {% F" z2 d, C        at java.security.AccessController.doPrivileged(Native Method)# p* A' ~* t: Q. M6 F& A
        at javax.security.auth.Subject.doAs(Subject.java:415)0 t5 N, m$ B' }# Y# {& r3 n
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)2 i$ ~3 D# a3 ]6 ^3 w: e7 ]5 i
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)/ `* i/ ]1 C+ p/ K6 u

9 I0 C3 r0 ?( E- t: @Container killed by the ApplicationMaster.8 ]2 Q+ D$ @' x4 c. f! T

  {2 `+ Q+ ^2 L& i; E3 t原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
& m& F- A3 ~  l4 |3 }
( q) Q# s% s) d8 N' R( E6 Y解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:& L" H/ g" W& Q
<property>! ^7 d" H! l$ O/ u
<name>hadoop.tmp.dir</dir>6 o+ r; J1 w  D4 ]- j) W' i: f
<value>/data/tmp</value>8 U$ c! R5 @/ J- l" U# m; D) b& z
</property>$ ]9 W9 N3 a. p
然后重新格式化:hadoop namenode -format6 e: j& \# F1 x. [3 F) C
重启。5 _9 E  ?0 ?% E; b

+ R! Z! G; l) r
7 |$ h6 w: M1 C0 @5 z5 C
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
9 }) X. I6 J8 T/ C0 Y4 V2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
! j$ u. u2 J! E- V; P/ Morg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.9 l0 K: J, z. x/ a/ `0 M9 |2 c
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)( Y" G. I: l: @; ]+ B1 R- d& m
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)
, f3 _" ~( Z* t7 \1 V3 ^$ r        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
* O! e/ k+ ?& D9 ]5 G6 Y        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
- g1 g! {: c/ z# J9 B% Q        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
3 ^4 `' D, j" \- o        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
1 O+ t& R, I0 |        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)& b; G# h$ V5 R1 V
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
/ \7 _; N1 s+ `$ l+ R# w, Z        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
- g) E* t/ G, F/ M  I  D0 R8 ~        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)
# T% {3 B4 S; P1 H8 MCaused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration., ^* t: v# y7 F# d
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)  H( B% u2 Y) n3 m
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)9 D# a* ]. E9 a4 W
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)3 h5 g0 D1 X# C: D. z9 [5 o2 i
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)& r2 [; C4 ]+ O. l" O6 O  x# o
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
6 V1 k* U( J$ X% r* Y8 v
1 u6 B5 w8 J+ z, p+ H$ T原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。" J/ O' \" b8 R! w" |% |) I) Y
$ g' V' T1 @  y! q. \
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
2 H4 Z6 \* |2 c<property>
/ E+ c+ z% g2 Z: ?) @! s& \# m  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>6 `) \0 E! A, u8 r! k! C
  <value>true</value>1 D4 h+ t3 t. D
</property>
9 [4 w6 h" x# `2 Q. r+ [8 o4 Y2 M% n) e<property>$ A$ w2 w3 f1 m* N; m- Q: M6 E, _
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>4 A) D8 }; r4 t. Y1 Q1 n! H. a
  <value>NEVER</value>5 ~% _  t9 U1 P4 n) L
</property>
- ?( o! ?, q; N  y; ?# j0 c  k# D" E3 D0 D
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
& D) K. o2 O, e; s对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
) r* w, x  D% @2 v
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation- {2 l  t" l+ @4 u( |+ A, z) x
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010, Q7 D+ ?/ H/ `, L) `
java.io.IOException: Premature EOF from inputStream, s3 Q# Y) P7 }, C
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
4 C% q' `" ^! H  n5 l        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)4 y0 b: P. T* a8 f, U, T
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
' K7 i: D) d' k- k% R, t' J        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)* B4 C' {4 ?) R4 d8 r: G2 [4 _1 ]8 t
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
, x0 G# U& W& d4 E& |" J) L        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
: E8 y5 i6 _2 _, c) H2 O        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)$ q+ O* [4 F- w) y. z
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115), k3 J* ^. U/ E. l
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
; ?: }1 F* d1 v" A        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
# Q; S! x3 X" a' g, P        at java.lang.Thread.run(Thread.java:722)3 ?5 s% j* ?( F9 C% [9 Z) w
. D5 m  Y: o# M  L* p2 B  ^6 i. \
原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。* d7 x1 `3 c: B+ ]# s% D
& v1 A5 N, E6 P& ]. N; i
解决办法:
" K4 y. T/ T) t3 {! S  o修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):+ p) D9 f9 Q+ P6 f5 ~
<property> 3 E$ j3 H4 G* W. B: V+ T
        <name>dfs.datanode.max.transfer.threads</name> + a/ [6 h5 i1 N' O; p! g' E4 u
        <value>8192</value> 2 B4 O" d$ u4 a4 G4 x" _8 _( B
</property>5 e+ E  g+ q$ R6 v; j
拷贝到各datanode节点并重启datanode即可
* L& n4 S# `5 }
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑 & t9 @' z/ e$ i- Z; H1 t7 i

) a3 n7 g2 N: t7 n4 Y错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
* f( _# }8 {& ~* l0 e; z2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
$ q+ Z0 Z( {/ }java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]
7 ]; u$ i' k8 S9 Z        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246). j) p6 [- x' w
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172). ^+ {" a' A6 j6 ?
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
  J/ W& n. A0 l2 D+ l/ \        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546); Y$ b9 E9 V$ k/ B5 O
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
. K. ?' F4 l9 j. w4 Z# \        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)! Q7 b& H" D$ L
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
5 u, L7 S2 @- c' g6 q$ \: F        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
( v# m/ Z- h# m4 B9 H        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)8 ~' k3 k( J# x5 ^0 A
        at java.lang.Thread.run(Thread.java:722)
$ W# x8 Z3 E- V7 p6 i1 B% F
8 ?2 c: Z5 Q, y! Z原因:IO超时
% y& ^1 q" ~$ k4 X, H6 o& a' p, Q* s2 a% w2 o5 v2 }1 C8 K+ a
解决方法:2 K' g$ X3 o5 e) O9 A3 `1 |) O( ^+ j
修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
8 Q" ^9 v' M; `4 s8 J+ l    <property>8 c% m: r+ \; @4 S$ o
        <name>dfs.datanode.socket.write.timeout</name>- X9 Q! S, r, R; F/ R3 B1 Q' V
        <value>6000000</value>
) Y$ o# V/ @8 q7 Q8 n; _4 s. y+ `$ Y    </property>
+ C8 M+ Q" z* E1 U0 l$ t" ~- ]9 l! ^. z, L/ H6 k+ {
    <property>% U) ?* a* v8 C; A+ d5 k& a) B
        <name>dfs.socket.timeout</name>
& r) {0 m* \- P0 m        <value>6000000</value>$ V  @2 S4 ?9 l0 ]
    </property>
7 ?6 c# r+ |7 g7 G2 b% h8 Y# c, Q/ w% `- g% R8 k
注意: 超时上限值以毫秒为单位。0表示无限制。! _( B/ \% b% b: Z$ G
6 A  S  T* N4 r8 J4 E# \

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑 . Q- t/ S- Y# L5 b) |
0 J/ h. L8 j) I1 f3 D
错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
( z: F2 O( K; h1 {6 h* S) k
- B. l) ^$ ?: ]; s14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. 2 W% m% ]% z! X" l
This token is expired. current time is 1398762692768 found 1398711306590) m. N4 ?( t- L2 V
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
& \" \3 q) I) ?, n3 F8 x        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2 P. H- `8 s1 f& p3 h1 j5 I        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
. X. }! I5 l% y: l- j        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
/ |1 \, w5 z, I, u( u! G8 z1 V        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
  X; J4 \9 [# P/ `( G; \, w& E  o  V        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)" O7 e5 g' }: G5 b" A% u; B
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
# k# v7 G$ Y$ c' n( G        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
0 C/ ?5 p# }, k9 b        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)! J9 b8 ^/ b3 I
        at java.lang.Thread.run(Thread.java:722)
5 g* L" K& [: b7 m. Failing the application.6 w" D) P5 D5 p4 N! U
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
3 \4 i' ]5 A& X# w+ m2 S: p; H2 }1 C( \1 K6 z) _7 Q
问题原因:namenode,datanode时间同步问题
# O$ h4 @1 R, f  A8 r) ^6 H# [* N- e- l8 q# D
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
% y, {- ~7 {% Y2 i; P最好在每台服务器的 /etc/crontab 中加入一行:
, U: o( @' u7 ~* M+ x0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部