锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 70115|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑
9 Y$ P; D  s+ ^' h  e( u- F5 m, v, P
收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。, _! V- h: M1 r* [7 I8 r& x
/ J& W$ \3 y, j9 \
错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后9 G" k8 a/ `4 z0 s3 v

$ T% n0 L' ^5 I2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
1 ]$ c# o: f/ [" z1 wjava.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee36 Q0 v( N& _" ~. c1 L6 p
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)/ q- r2 S; S# G! D2 _
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)/ r; g4 p0 q3 R# V% p( K, z: j
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)& P2 K" v2 T7 F1 w
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
0 a% |, y* T  [% |& C' k        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
3 L' p: p1 [! ?. v0 {        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
( o) k2 M" S& H$ e7 M3 u. N# w        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)- g2 ?, r9 V/ f8 F; w& c3 A& N7 }
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)4 H' a9 c: L  M% n$ ]) S
        at java.lang.Thread.run(Thread.java:722)( E. J' J( J$ T6 x* G7 v
2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
& a1 {: {6 P+ U: Y0 ^2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)2 A" Y* s: j3 c
2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode1 w2 M/ ?% G6 H, t6 V/ A5 K; d9 ^
  ^8 O, f/ Q# X: o) b! p* L' I5 p
0 s( Y( E7 T  A8 d2 t
原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
) _& f- i, U* e9 l" r
: P! b- h& u6 k* L1 M4 y, K+ `解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。
4 s1 m2 h5 @9 b% t8 t2 x2 I4 V# X" L# b, ~& c  i- B0 m" v

0 R- Z" _0 G0 X% a& T0 o1 q另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
" u, M- l, E6 m/ I" q/ _0 a) s
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2015-4-8 14:59:49 | 显示全部楼层
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
4 ]% L9 U$ w) \* u2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join$ w/ P6 i8 {1 f* }+ u6 S6 m
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 41.
5 q9 j& g# T' w+ Q        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)# b" {1 h2 L3 `0 y
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
& R4 Q5 V0 E' c$ N1 _( |3 [        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
* H) O+ b7 D, l# \/ X! ~! |4 |- c        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
3 h) q+ ], L8 @        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)
4 Z0 }1 P* _/ ]0 R* J, d" r- G        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
- K- f# a3 U# \6 ~; W, H        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)! f# e1 b7 q9 V
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)  T; J: ?3 I( j
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
; B- ~  {' @% u! y9 I/ M7 `9 |        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)  ?; F% t" i+ j1 y) D7 K
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)" k9 Z4 W4 a8 \% j' l3 |1 E
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)
9 Z8 Z0 ?( \5 Z8 t9 N. \; y        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
3 [) ]/ [6 k1 i8 Z4 w) H0 @        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)
5 U$ i4 d0 C' |% }% i6 m7 S2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 10 e& p  n9 c  K8 O4 A+ C. `- G( r5 Q$ ]" c

4 Z. X& a" \+ d% {! R4 n( M原因:namenode元数据被破坏,需要修复/ f2 D1 n2 Y9 Y. a
解决:恢复一下namenode1 i8 ?4 P: [3 J$ |! [1 ?
         hadoop namenode -recover
  U8 H# z5 v1 C" ~' e         一路选择c,一般就OK了
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50
! ^: s9 w  V0 b兄弟,你最后一个问题是怎么解决的,有解决方案吗?
, V" c0 ]2 `3 _% x3 `
不好意思,才看到。! E5 \8 [3 A! Z- g
我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP2 P8 E1 n1 y* @  U9 v
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。2 I) F) N) C) y. q& j

& N" ~# m* E& F! b) a, c郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
& z1 K9 h% W; i, A* L' M; \# b- e. h) d. `* R1 F% [0 o3 J( X. z  }
这个问题告诉我们,运行过程中的监控很重要。0 T# b6 O$ Q7 x7 I4 @0 ]
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2
; Q8 r: g% D& g7 R6 X( @! h% U( ejava.io.IOException: Spill failed
% C+ O& q3 K4 }- J. n! U" \( Q        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)# O6 f' X7 ^/ m# ~
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)$ n& @7 q5 F) q! q
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699), s+ w: i& Z, \2 f! I( k( z
        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)2 i3 Z! z# e& o
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773): ^8 L/ ?$ e# y1 g. S1 {
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)! ?/ j3 q9 R) F
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)& l( V6 Z( J2 {7 g
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)4 Z* b" x, v5 X) i5 R6 g
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
0 @. E; h+ M6 o$ R/ w. B        at java.util.concurrent.FutureTask.run(FutureTask.java:166)( e* f  A( }# D% p% A0 N$ J/ c. J
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
. e' _; `: J/ L9 s7 T. m3 ?        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
7 y+ a" I' e7 ?$ N' C        at java.lang.Thread.run(Thread.java:722)
! W- ]: Y2 }( I! L$ C3 pCaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
+ b! o- I& X+ j% S        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
* x4 ~5 _! H( \9 v0 l' V6 W+ s        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
% |& D: O" m' N  n        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131), p- ]+ ?, N" J: [7 \
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
$ x+ X8 z- Q. H% b' Y) n        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
1 D# B, `) ~8 [4 K: W- ]( C        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852), ]  \: |9 O6 m: f6 A* d" H2 d
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
$ i5 _" J( h/ i. @9 X8 u6 y5 s9 z
; k  _8 g2 f+ D6 [6 F" y1 k. E' s1 V5 u2 b
错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)/ h- R( @5 q. [. F3 K! x
解决办法:清理、增加空间4 O- U, k) r" m

* I# ^7 U8 M" c( c/ o+ z$ c0 [' t( r
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑 " i# @9 v* N# X- B1 O2 Y5 o

6 e  k" q$ s. G, h$ [& ]0 ^2 i错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for 8 q! c9 y5 P  G2 v# {1 H
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
9 d' U. u, n5 R" CError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out
9 e: @% u0 u5 i        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
, l- A; F4 s% A7 i        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
$ I3 D. x8 A* {0 Q# U( T  n6 }) \3 v        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)- X/ E. ]7 L" y: D% ^
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)& C4 ]: p$ L# g% v% T
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)# U0 W# X6 ^* G& A" Y
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
4 t1 P1 x+ G% e' F: p6 I3 R% J2 B- L        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)8 i; W; D' i. _9 e* u7 i
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)
' e( j1 Q" `) o3 _: U5 M3 V- r        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
& ~+ u! g2 M( `6 h( T  K- J# x' l        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
3 i) K0 q( u' \) @2 o- X  W        at java.security.AccessController.doPrivileged(Native Method)
- K8 h/ L9 N2 f5 c  G5 n4 N        at javax.security.auth.Subject.doAs(Subject.java:415)
1 C: q# |8 R# Y" [        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)' ?$ m! ^3 T9 \8 J
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)8 B1 r* e* n) Z% |& @

7 p- ^4 }' L8 B  Y- nContainer killed by the ApplicationMaster.! y' m+ Y; m6 C! ^

- O  D# t" l5 A0 {9 B' Z' d  u6 }原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。! e" ~# N3 I: L: G" k! g! x
* S2 o5 U% w: m5 s8 R8 ^' `
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
) n6 c5 a( L. G2 B, U, `3 M<property>1 k* h+ |- B5 V: Y( d* s; f5 S+ o
<name>hadoop.tmp.dir</dir>
) f& \- k6 i' x/ p- q- n<value>/data/tmp</value>8 T4 A% s; [& j. O/ H$ W$ q  y; F
</property>& L; v1 p- x, d4 v
然后重新格式化:hadoop namenode -format: h5 c, y0 c& {* |/ t
重启。" y2 |3 o; V$ f5 g( r

1 ~' V6 D$ X% h) K; G( J4 }7 v& k& \7 Q$ m7 f6 i
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.( m& p2 X0 V8 t' N
2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed $ r9 X5 E+ |) t
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.' t+ F5 v& B9 R
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
2 Y$ v  R8 d, @+ j  I, k        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)0 \7 V0 J* ^6 }  ?! l* s) y
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)4 L- ^/ Q2 |' U$ N1 a% ]
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)" a, g2 s1 U$ p4 h1 b
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
7 G8 B& q- E1 W0 }2 [        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
- `& ^# ^0 h- y) @7 I        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
4 `7 j4 r* }7 I" K& j) @        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
9 I5 A( `$ s. O# A        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)* L9 w$ u$ p( ]2 T! d
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)
9 W) Z' ?6 M# T" L/ v9 DCaused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
. @( D1 ]1 Z- w        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)
1 C( F3 d; s* a% r: D, M  a        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)
. A6 u1 t! d; X" N# g2 W        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)/ O7 Y* M4 z8 b+ `) E/ r3 Y8 N
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)  L$ [* F8 e6 n9 k
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)3 ]) T, q" Z+ h- S" T( B
, @! J* A) S, k4 A# j5 _; {
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。( Y8 k; z; s: I& K% r+ D

% l  |0 ]3 [1 O+ ^4 A; f" Y解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
/ w9 P5 Z; N0 \8 s<property>
! s  G& u7 l& Q  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
- |  J9 _6 O7 {2 {; b0 [1 b  <value>true</value>
; R! }$ j$ Z+ A; s$ l</property>
+ `. Y/ ?9 Y" ]2 ?/ a* m<property>  F; x* w, _) y" V
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>, `# C% w4 T$ N! \* d& T
  <value>NEVER</value>
8 D1 d! F" T5 g8 m3 \( [4 B& B4 E</property>
% M/ [, f& m1 E- e; q9 z
& L" G0 L+ X/ E0 ^对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。  S- K: {7 \& j5 S; G& w+ }) A
对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
% n3 D4 ?; h# Q
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation
& i! ?2 J7 z. o! D) v8 c2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
# S$ }+ r+ d8 _9 @, \java.io.IOException: Premature EOF from inputStream
) B8 ^: }+ X, O% G        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
9 n  Y$ Y0 m7 Y  o        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)/ p1 ?8 m; p0 Y) [+ [+ V6 n
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)5 Q3 i) n( i3 I7 c/ K
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)4 _0 U$ L# ]  [) _) r
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
/ |3 ~4 x0 g( \1 J, ]& \0 d& g        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
6 }. V7 ^9 t' h% O+ l- j9 [        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
3 ~* I. S8 j; Y) J        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
1 l: k7 _. \* X, w. t+ r2 ?        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)7 y% @2 N. w5 p1 R8 F
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
) A- K: e6 i& N8 V        at java.lang.Thread.run(Thread.java:722)
$ `6 Z4 r1 |! l7 x
" K1 S5 d! \" `- {* g原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。2 I" r# u3 d/ o9 q$ d5 @
# b) w9 o( m9 Y9 _, n
解决办法:
: f1 N$ @8 F$ M' a修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):$ Y/ ^0 [4 Q5 g- q/ Z! K
<property> : e; S6 V2 m: C
        <name>dfs.datanode.max.transfer.threads</name>
' ]* G6 A4 D6 b' G) C* m2 p! J/ J        <value>8192</value> ' T* Q+ [' ]2 n( ^! v
</property>
8 J1 _/ w* `0 \* V7 D拷贝到各datanode节点并重启datanode即可) e2 M3 v9 |4 C9 q9 m. W) |
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑
5 V3 @+ P5 G% m2 T# Y! U& i7 l9 e6 u! \" Z; j
错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write' }. F  o0 h% V* ^, a
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
7 t) k) g  u# b5 P# qjava.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]' T2 {% B- j5 |! u5 `! L$ x; Q5 [
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)) R5 A5 y; ^' b' c, Q
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)$ B% p' K/ B- I+ k
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)) F" w3 |+ Q2 a& _
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)' y) `6 M$ X0 H
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)$ \$ o5 g( k" Y  G
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)) E; k4 _0 f2 I' D( Q
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)) j/ D- P% }3 M8 I" ]1 x
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)  d$ _3 K  I( z
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
: X- w: ~& h5 [, k3 ~8 Z3 k' K4 \        at java.lang.Thread.run(Thread.java:722)
5 O3 x$ H* `2 ~2 |# L+ U3 \. T( ^! n. W- M- w
原因:IO超时  U" g8 ~- L8 t9 M& k

: m6 \5 d5 Y+ H. m' {; m解决方法:
7 C. `. l; @1 r: o% Z修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。
6 e4 ?7 u5 L. l    <property>8 b- K4 W  w1 a. O, L4 _) f0 ~3 _3 b
        <name>dfs.datanode.socket.write.timeout</name>8 i- l6 o& ~4 L8 P* h* l
        <value>6000000</value>
4 |7 I1 T: o) l$ [    </property>
8 U7 A6 I* a. f5 P9 J6 X5 @
% S5 h# J2 X& ?+ A% ~4 q7 N+ K, J    <property>7 v+ r; n1 Q- @' O0 {  ?' H
        <name>dfs.socket.timeout</name>4 E( H$ {9 b9 U6 X3 ]
        <value>6000000</value>
: N# Y  f0 p  [+ o1 j5 d9 {    </property>( R2 G' J% d5 k& e
4 H3 Q& y9 c9 ^6 Z2 V! i" N2 G
注意: 超时上限值以毫秒为单位。0表示无限制。
: S) M0 y2 r& ?5 X+ }7 x/ Y, a; p) J. A5 R" @

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑
  N  s: b2 m% r2 |4 ~& z2 c
- ]% G+ o' u/ E, ?6 o+ F5 S错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
4 N' A* {7 }$ T! b$ H
; Z8 y: r" b1 J) Q) r1 ?14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. 8 M6 M) `" T5 m  X% b( i1 C* Q3 F
This token is expired. current time is 1398762692768 found 1398711306590# ~) i. I% E$ [
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
% W9 }1 ?' {& {        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)# Z0 n, a+ e" }. y4 g4 n6 R
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)+ L- G, Q6 Z' N4 b- ^) @2 A
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
# B) @! ]: j. a1 N' i/ ]( j        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
9 M  C! S$ @( F2 a/ t+ N% ~, W0 L        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
. y" O# e; s9 C6 H$ s. x  x( g        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)7 M" h& ~9 q* f& X- \) I1 o
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), z) y' ~2 O9 \9 o' x
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
4 f9 e' T5 s& b" p; V        at java.lang.Thread.run(Thread.java:722)9 w7 O2 a* Y9 Q3 Y; `
. Failing the application.
/ a2 q% Z! m5 A$ s, ]& [0 U14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
* |6 c( n5 M4 ]
; t% F+ j3 ?5 O6 I问题原因:namenode,datanode时间同步问题! R  K+ O; z) J0 e+ N: e' f
. S& w6 A# c) I) B
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
' |0 A3 G% I7 Y  ~( B- K$ _. Q: H最好在每台服务器的 /etc/crontab 中加入一行:
2 m' f- Y$ n2 [/ ?0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部