锋云网

 找回密码
 快速注册

QQ登录

只需一步,快速开始

大数据培训资料领取
搜索
查看: 67993|回复: 10

Hadoop常见错误及解决办法汇总【时常更新】

  [复制链接]
发表于 2014-4-29 14:59:45 | 显示全部楼层 |阅读模式
本帖最后由 锋云帮主 于 2014-4-29 15:15 编辑 ( }/ K$ k6 w4 n. ^. c: Z) m7 |4 i9 M

! \+ X( {- Y  M  D3 i% N/ E8 g4 z收集日常开发运营过程中常见的错误及解决办法,随时保持更新,也欢迎大家补充。
/ r5 o# _# t# `+ B/ s# ^
: L& i; W9 j+ G错误一:java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后
1 b& J' \0 K& u  L5 s
$ Q4 A* c) w8 N+ N& c8 O- A2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:90003 C; d/ G" n# V/ ^* E
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
0 i; f( }( D" G6 }* a6 x        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
. q$ W! a) X% I; c# I        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
" u+ l/ H  v- q/ h        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)  K& z3 E: i5 c  ^! [( v& f
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837), h# \8 F& `" A: A  B5 v! ?4 e
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808); |' ?4 l7 t* k* P) f
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
4 D. |) F; V1 N# @! R5 a' Y; ~, Z        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222): L$ s6 \( A- G+ L) M
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)$ I! u# _1 O7 c
        at java.lang.Thread.run(Thread.java:722)
* X! L+ ?! V" l" s( |+ t! H2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
+ @" N/ J6 x& B+ p3 ^2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)  R3 ?3 [$ \: ~  F9 q; ]
2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
, g0 b4 g1 [5 U2 s& j' L0 P; J/ y+ }, P' z

8 ?% m9 w. y# m7 ?5 G原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.# Y8 K. h. S. i: @7 v; B; f

: J& T3 X1 V+ D+ h+ S解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。8 ]" A: X/ W3 v5 Y  R6 g

- u/ A/ N, `( j; T! M
) K% m; `" R/ z0 D另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
  ?) J" Z4 ]  j9 [2 s1 I) P& v4 z
学大数据 到大讲台
回复

使用道具 举报

 楼主| 发表于 2015-4-8 14:59:49 | 显示全部楼层
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.: n  ]' X" M4 z7 T' h- Q% i' t
2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join8 P' y# e% c! Z/ Q  Y& R8 x# X6 E  E
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 41.: T5 J2 y& K& F# z6 u+ q0 {1 q
        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)9 B3 t. s3 I" m9 w8 x
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
& x; L8 I& d( t( m! _  K        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112): c3 M/ t, ?' V4 ?0 d6 t4 C; T
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
4 a$ v+ Y0 |& `: S% H( b& X        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)! v4 A" o- C1 W6 X2 q
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
! l) m' u6 V1 B6 E+ W4 H+ E        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)# Z( F/ k- B7 Z/ ^. T( ~. X' L
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)
$ ~( ]1 k' x3 M' C- s0 _7 I        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
+ s; c! P: O- M9 A9 i) Y3 w, I        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)
! p1 i; E/ O4 @" F: z+ y        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
1 y) `- s6 E: _. v" p7 s; I        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)
( B( ~) g8 a$ ~1 y' _3 u        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
* H+ X; _' Q: l9 ^4 t        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)% c: Z, j' {/ T# T3 X% x1 `
2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1( B3 d, J9 T7 K# G

* R  v8 C/ g& R3 i. g; T1 W. R. O6 x2 g* X原因:namenode元数据被破坏,需要修复
# Q: D; b, h5 g) P/ I解决:恢复一下namenode
; C' k) D; C1 g7 m" a2 |( k2 N         hadoop namenode -recover4 j' b7 H: }/ B& t: Q
         一路选择c,一般就OK了
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2015-4-8 13:39:15 | 显示全部楼层
jerryliu_306 发表于 2015-4-1 15:50
/ f" w4 k$ ~/ `  ]% A兄弟,你最后一个问题是怎么解决的,有解决方案吗?
" j) u1 Y1 r0 u0 e7 v  n
不好意思,才看到。
, ]$ ?) b" R$ m/ O1 g6 l我说明了,问题就是硬盘空间不足,中间输出比较多。但执行失败后这些输出会被自动清掉,所以检查又发现空间很多,产生误判
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

发表于 2015-4-1 15:50:38 | 显示全部楼层
兄弟,你最后一个问题是怎么解决的,有解决方案吗?
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-7-4 08:42:16 | 显示全部楼层
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)        at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
  x3 ?; y. k$ p/ W4 S错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
. N# s# D- H/ D' t2 k; E5 }
0 a' S0 w7 n; U6 ?' D郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。: N- v. G) R! Z9 W/ F0 U

: o6 k" c/ J1 ~% X这个问题告诉我们,运行过程中的监控很重要。
' ~# f7 |! r% a. J, U' m! T
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-6-19 10:27:52 | 显示全部楼层
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2% j! w" C" s- b* L9 F) E1 k
java.io.IOException: Spill failed. S2 c2 E/ h( y& F) h  ^4 B
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
% V/ G! w# M/ h; n1 n3 ?        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447), h, `% E/ `2 Y: P
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
  ^  e7 V: L/ o, |        at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)" _1 u2 c1 i1 U  F+ j
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
! L- h8 c7 b, R5 A        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)& n% e  T" A, E/ E3 }5 l
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
. ~, ~  S" M+ w8 _2 ?        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)% g2 k) N' ~% [5 u' U
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
7 V% E( K& R! |! ^' y7 ~: M  k        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
& R: U, l  L# J' b! w        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)8 C0 f( [, C# E) Z* l9 p- @
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
& e0 W4 G5 {+ A$ h) g        at java.lang.Thread.run(Thread.java:722)
# k% d$ k0 \1 B( ?$ f. e  sCaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
4 c+ R+ s1 D- u& D3 j3 m6 |        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)3 H9 s7 \4 T; g* M6 N
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)* N6 T1 `) G/ @( L+ p6 O) [
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)) d& S2 P. h3 x. s2 E' F
        at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
  g- {9 s* l2 _1 L& S+ O* l' F        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
6 \" j& k  e0 I$ ~        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)9 n1 E( Q3 `8 b1 Q
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510); k; ~- M; ^* @$ X- c
! ]3 {% D# G' E7 ?( C6 A

+ h( a1 L7 l/ _; `. C- U% S7 K错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)  T+ ^5 Z2 T8 h% U0 x1 M( j
解决办法:清理、增加空间$ t  b/ w5 T; B4 n2 l$ |4 u+ o

/ f7 @3 P: J$ ^! L. r: ~3 x( K/ Q! E" J
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-8 12:36:37 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-8 14:08 编辑
! g, M! K* d1 E2 V
0 W6 r4 n+ ]* V9 {3 I错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for
# {. e, ?* o& h' |14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
! w" t. {) H. ^% w! g3 D$ y3 JError: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out& P  X" @0 ]7 B
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)0 k. k# n! j1 d0 q: |$ w
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
! _$ c2 N* @: a5 ~        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)0 `& w& {7 ~, V; i# f- x9 S+ d
        at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
, {. T$ t% w8 L7 C" s1 P6 A9 t        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
+ g! |: _3 a+ L9 ?$ y2 A, N5 g  Y        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)  ~: A9 v1 ]( a7 T/ T) U) {7 {
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)2 S; _7 G. G: }) ^
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)' t# f6 G* W/ h8 {5 i
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
6 o: ^+ [7 R/ U6 Y: B5 `        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
/ E- b( T- N  B8 D7 `2 |2 B7 h        at java.security.AccessController.doPrivileged(Native Method)
2 T. s, X2 D7 k' h' f4 S        at javax.security.auth.Subject.doAs(Subject.java:415)
, F; u4 C) o" B+ j8 j& u        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)! R; w: I8 M. g4 u
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)* L! E0 x( C, v6 a
7 Y2 z: n; t* w6 h) f8 d# e* R7 y
Container killed by the ApplicationMaster.
- f/ Z+ n$ [& V" v% ]- |6 \$ E6 r8 H. [2 F) C
原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。' B2 I: j% M; v' H2 c

. T" O0 E9 A6 A- Z6 b) }8 H解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:% R" o  Y. j( Z! s9 O2 t5 x
<property>' Y+ w0 b* Z1 r2 e; _, g# h* E
<name>hadoop.tmp.dir</dir>
0 i0 M9 m4 v, x) p' G/ B5 Y# f<value>/data/tmp</value>  Y! D" g$ R% q
</property>
7 W) t7 C6 y8 {' y: e" b  K! R然后重新格式化:hadoop namenode -format
- U- F$ \* H: s" v重启。  {1 ]# h6 n) [: j: r
( n, Y' r7 i" H, I; ]2 R
* I4 p0 o) w6 _0 U. z- s5 |# D
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-7 14:41:35 | 显示全部楼层
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
& i; Q1 B8 X+ m" `2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed 5 c$ l1 v" `8 `% t1 o# I+ D0 f& w
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.1 Q; N# m# f( e7 C
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
. T! h9 f2 x3 I; A( b6 m        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)4 N' z- k4 q; t" G2 L/ [7 w& ~
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221): w* R8 I9 e9 X% j, @8 q# ^4 Z
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)( t3 h; _1 _9 K# F& Z
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)2 {6 Z1 R) J8 n# _" m
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
: ]0 o" h& h9 A' F  C; @        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
) t7 x1 a6 G* ?9 R0 L" i        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
/ o4 V: `% x8 _& _/ z+ t0 _        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
6 K& n9 Q! ?5 D) A  A* [/ U- i        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)" E4 H3 @8 n6 {8 k( Q
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.  I9 }5 u3 ?1 h
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)- ]- Y$ M1 a' R: k
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)  d1 }- W- r% A7 w/ T) J4 S
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
6 h& h6 i# x2 q0 _4 {) C: L% f        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
! M4 M' ^/ M5 d. `9 G4 d        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475), ^- R8 F  I# L, d
! e& O$ c  Y; d' `9 z& Z0 I
原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。1 n3 \+ W4 z7 J# v( m$ g5 v  Y
: M  z- P: K0 p" ^: L
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:% {  o1 z7 o. O% m. ?
<property>
# H! y) V9 x5 m& f4 d, O8 h7 u  <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
0 S2 Q- {2 ]( G/ p& X2 H- k( n5 u  <value>true</value>
. m4 ?" d6 V3 m5 [& Y3 {</property>
' g) g: I0 j1 T/ y1 j9 o<property>
0 l+ I, S5 {; t* h* x. z  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>; Q8 X1 y) o, U
  <value>NEVER</value>
4 R  r- q3 q# Z. |; u! a</property>
* b! w/ l  a) G0 a4 V/ \3 z# R) u* ~( m7 n0 z
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
& W9 h) ]5 O% f6 N对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
7 f7 |2 |' y! d
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:43:15 | 显示全部楼层
错误:DataXceiver error processing WRITE_BLOCK operation
& w: _7 v# @% x, J: X8 L2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation  src: /192.168.1.193:34147 dest: /192.168.1.191:50010
9 X: ?* P7 H3 J' l" i6 Ljava.io.IOException: Premature EOF from inputStream
. X1 @' K7 n$ T' `        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
3 D$ E6 }+ E$ _% ?  B        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
1 P) ^+ A7 c  e% w: B1 {        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
3 @  B& H! T- w8 S$ A( Z2 E        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)" {3 D' k5 g' f9 }' Z2 n& h
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)- g  r3 _5 |7 c# B6 b
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693), `" @& h7 L+ k( f, K0 S. a
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
6 x, Y& n6 t# I/ B, Y7 v        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
. h. i& M9 K# j' @" E$ q2 o7 W        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
! y, Q" Z) R4 v: I: ?        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
( T' a( i! a/ Q" n        at java.lang.Thread.run(Thread.java:722)
; U  \+ C' ^! V) Q
0 k: R3 G2 U. g4 S/ w原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。
& R/ l1 f0 C# w; D
5 x7 k9 I; x' a# \" {解决办法:1 K, j$ j$ P- [8 [  H
修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):# q) a+ _( }5 y7 Q
<property> 4 t" ?( u$ c! n. \. w
        <name>dfs.datanode.max.transfer.threads</name> 9 K4 O3 T2 a/ |- [& c$ L
        <value>8192</value>
0 P+ h: ?0 q/ O/ z  p</property>
& A$ N+ ~3 k9 p& f" r2 q5 k拷贝到各datanode节点并重启datanode即可5 T; y: L7 f& N6 k: b
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-5-6 15:14:57 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-5-6 15:55 编辑 6 ~& J8 E  Z0 \6 J' c; o
. l  u+ X* {5 d: m. ?& Z* n( ~
错误:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. f' I" P, e* E, l; x. g
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010
. `& [/ |! s; Y) [java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]0 ]* V9 a9 C; d. `+ d: y* g8 G
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)& n4 X5 B/ [7 O( o: `
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
4 u4 M0 u% P" G7 H        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
. l0 n3 R4 n7 J+ Y5 H  L, O        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)2 O- B# I3 C% J; c$ y1 G
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
6 T' m, ~. c# @. X" m        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
. [$ M: X3 u! s6 P! q- D2 f        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)% y/ k# y2 E: F# y1 M7 x
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
7 r4 M( r8 {+ u) ^0 T8 C3 y        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)' L1 p9 R  J/ u2 ], O  S8 S
        at java.lang.Thread.run(Thread.java:722)4 @& p, b7 V- v1 J" d) ]
: j, V, x5 {% s/ p# s
原因:IO超时& j9 x6 g' d/ w+ S* l$ G

4 R' R. Q5 Z. N7 ~9 O" c, S5 _解决方法:
  b* P7 \" s8 B! q" s修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。( }7 x/ m! @: h' N6 Y' e! ^
    <property>
8 R8 Z( e/ K2 E4 E        <name>dfs.datanode.socket.write.timeout</name>
! a  }# S* a/ y7 t& J+ }        <value>6000000</value>4 r8 c" R* F" }
    </property>
7 J# H/ f# e' N8 M: Q4 e) A5 G; J2 M( B/ n
    <property>+ b6 n4 g# E6 a" C; S3 ]  k2 B) h
        <name>dfs.socket.timeout</name>
# L, `" M4 h& P6 @7 n' k/ \        <value>6000000</value>* W0 J, z3 Y5 v8 r7 e8 w5 p9 ?  w; X( A" D
    </property>
7 k# C# L9 A2 C& f
! z8 [( D6 A" w8 L, r0 e注意: 超时上限值以毫秒为单位。0表示无限制。
+ i6 x1 E, W3 @5 {! l+ T, I5 [! F) ^( l. B& j# M5 w9 U3 [

点评

设置成0还会报这个错误:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel  发表于 2014-5-9 16:50
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

 楼主| 发表于 2014-4-29 17:18:44 | 显示全部楼层
本帖最后由 锋云帮主 于 2014-4-29 17:28 编辑
" @% v. A- D4 Y: {. @3 f. m* ]) W8 Z) `% `
错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container# Y" `* u0 R/ e& s5 K- l
0 t4 t5 V8 R; R1 E* x. e) x
14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
1 v1 q% y; y- o9 ?This token is expired. current time is 1398762692768 found 1398711306590, e8 _0 F- f5 {% x* F' W; i
        at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)" v% J; i" a4 J& _5 i" u
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
0 c, B" F' w1 g! ^) Z( s, v1 C        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)7 G6 m' P. c7 T. z% S
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)$ `+ J' t# C/ j) v5 s
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)/ Y" B' C) m! d! L/ y
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)2 j! _- v5 }, L, D, M
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
  Y2 N/ Y8 ?+ f0 A/ ]1 q        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)% ~: @3 ^& k' ?' |: q% N6 w
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)5 C) r. Y) u( F. y7 P1 _
        at java.lang.Thread.run(Thread.java:722)
3 P  J  C/ b. c1 Y7 s2 D+ u, }. Failing the application.* {. {2 o+ h& {
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
3 k5 H1 X+ h  @) Z  @  N1 b7 V& ^! D( d# V0 \
问题原因:namenode,datanode时间同步问题" ]+ R( T/ W6 i, }# u( R
# T+ ^( F8 O. M8 E
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。, \8 a7 `" w3 Z, m' Y6 ^
最好在每台服务器的 /etc/crontab 中加入一行:- j8 K  M: Q; @  H1 P" Y
0 2 * * * root ntpdate time.nist.gov && hwclock -w
学大数据 到大讲台
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 快速注册

本版积分规则

在线咨询|关于锋云|联系我们|手机版|投诉建议|版权声明|云计算|Hadoop|大数据|锋云网 ( 京ICP备13050990号 )

这是云计算时代的精英部落,这是中国最大的云计算社区 —— 锋云网(sharpcloud.cn)!

本站CDN/存储服务由本站CDN/存储服务由又拍云提供提供

Powered by Discuz! X3.2

© 2001-2015 Sharpcloud.cn

 

锋云网官方QQ群

中国云计算精英群(ID:64924638)中国云计算精英群      Hadoop技术交流群②(ID:25728812)Hadoop技术交流群②      Spark技术交流群(ID:413581066)Spark技术交流群

Hadoop技术交流群(ID:113156288,2000人群已满)

返回顶部