💡 文章概要:上一篇文章介绍了几种内存分析工具,在工作过程中,使用了这些工具解决很多线上的问题。本篇文章中使用了jstack工具,帮助分析出了spark driver线程死锁导致作业卡住的原因。
1. 背景
spark执行sql时,出现卡住的情况。
2. 分析
对于线程卡住,可以先看看CPU和内存的使用情况,发现正常。
此时需要查看main线程是不是卡住了,因此使用jstack工具查看线程状态,发现了线程死锁的情况:
Found one Java-level deadlock:
=============================
"DataStreamer for file /tmp/spark-events/application_1622622430053_5709820.snappy.inprogress block BP-1460625454-10.90.128.66-1594265842561:blk_3130778014_2091994216":
waiting to lock monitor 0x00007fa4f4008cf8 (object 0x0000000081f1e548, a org.apache.hadoop.ipc.Client$Connection),
which is held by "LeaseRenewer:datadmonedata_bu@redpoll"
"LeaseRenewer:datadmonedata_bu@redpoll":
waiting to lock monitor 0x00007fa756a44878 (object 0x000000008041e3c8, a org.apache.hadoop.security.UserGroupInformation),
which is held by "main"
"main":
waiting to lock monitor 0x00007fa4fc003f08 (object 0x000000008041e3e0, a javax.security.auth.Subject),
which is held by "LeaseRenewer:datadmonedata_bu@redpoll"
Java stack information for the threads listed above:
===================================================
"DataStreamer for file /tmp/spark-events/application_1622622430053_5709820.snappy.inprogress block BP-1460625454-10.90.128.66-1594265842561:blk_3130778014_2091994216":
at org.apache.hadoop.ipc.Client$Connection.addCall(Client.java:458)
- waiting to lock <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:370)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1513)
at org.apache.hadoop.ipc.Client.call(Client.java:1442)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getAdditionalDatanode(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:434)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy15.getAdditionalDatanode(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1221)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)
"LeaseRenewer:datadmonedata_bu@redpoll":
at org.apache.hadoop.security.UserGroupInformation.getCredentialsInternal(UserGroupInformation.java:1583)
- waiting to lock <0x000000008041e3c8> (a org.apache.hadoop.security.UserGroupInformation)
at org.apache.hadoop.security.UserGroupInformation.getTokens(UserGroupInformation.java:1548)
- locked <0x000000008041e3e0> (a javax.security.auth.Subject)
at org.apache.hadoop.security.SaslRpcClient.getServerToken(SaslRpcClient.java:276)
at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:219)
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
- locked <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
- locked <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519)
at org.apache.hadoop.ipc.Client.call(Client.java:1442)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:581)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy15.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:919)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
at java.lang.Thread.run(Thread.java:748)
"main":
at org.apache.hadoop.security.UserGroupInformation.getTokens(UserGroupInformation.java:1548)
- waiting to lock <0x000000008041e3e0> (a javax.security.auth.Subject)
at org.apache.hadoop.security.SaslRpcClient.getServerToken(SaslRpcClient.java:276)
at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:219)
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
- locked <0x0000000081f1eb10> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
- locked <0x0000000081f1eb10> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519)
at org.apache.hadoop.ipc.Client.call(Client.java:1442)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1256)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1243)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1231)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:302)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:268)
- locked <0x0000000081f1edf8> (a java.lang.Object)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:260)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1562)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:308)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:304)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:775)
at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)
at org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)
at sun.net.www.protocol.jar.URLJarFile.retrieve(URLJarFile.java:214)
at sun.net.www.protocol.jar.URLJarFile.getJarFile(URLJarFile.java:71)
at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:84)
at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:122)
at sun.net.www.protocol.jar.JarURLConnection.getJarFile(JarURLConnection.java:89)
at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:934)
at sun.misc.URLClassPath$JarLoader.access$800(URLClassPath.java:791)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:876)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:869)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:868)
at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:819)
at sun.misc.URLClassPath$3.run(URLClassPath.java:565)
at sun.misc.URLClassPath$3.run(URLClassPath.java:555)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:554)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
- locked <0x00000000814a34a8> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:484)
- locked <0x00000000814a34a8> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.access$100(URLClassPath.java:65)
at sun.misc.URLClassPath$1.next(URLClassPath.java:266)
at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration$1.run(VersionHelper12.java:226)
at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration$1.run(VersionHelper12.java:224)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration.getNextElement(VersionHelper12.java:223)
at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration.hasMore(VersionHelper12.java:243)
at com.sun.naming.internal.ResourceManager.getApplicationResources(ResourceManager.java:561)
- locked <0x0000000081f1f000> (a java.util.WeakHashMap)
at com.sun.naming.internal.ResourceManager.getInitialEnvironment(ResourceManager.java:244)
at javax.naming.InitialContext.init(InitialContext.java:240)
at javax.naming.InitialContext.<init>(InitialContext.java:216)
at javax.naming.directory.InitialDirContext.<init>(InitialDirContext.java:101)
at org.apache.hadoop.security.LdapGroupsMapping.getDirContext(LdapGroupsMapping.java:310)
at org.apache.hadoop.security.LdapGroupsMapping.doGetGroups(LdapGroupsMapping.java:240)
at org.apache.hadoop.security.LdapGroupsMapping.getGroups(LdapGroupsMapping.java:209)
- locked <0x00000000804b6840> (a org.apache.hadoop.security.LdapGroupsMapping)
at org.apache.hadoop.security.Groups$GroupCacheLoader.fetchGroupList(Groups.java:239)
at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:220)
at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:208)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
- locked <0x0000000081f1f150> (a com.google.common.cache.LocalCache$StrongWriteEntry)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:182)
at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1602)
- locked <0x000000008041e3c8> (a org.apache.hadoop.security.UserGroupInformation)
at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:64)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:439)
at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:715)
at org.apache.hadoop.hive.ql.session.SessionState.getAuthorizationMode(SessionState.java:1504)
at org.apache.hadoop.hive.ql.session.SessionState.isAuthorizationModeV2(SessionState.java:1515)
at org.apache.hadoop.hive.ql.processors.CommandUtil.authorizeCommand(CommandUtil.java:55)
at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:60)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:874)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:843)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
- locked <0x00000000814a2a58> (a org.apache.spark.sql.hive.client.IsolatedClientLoader)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:843)
at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:833)
at org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:991)
at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:115)
at org.apache.spark.sql.internal.SessionResourceLoader.loadResource(SessionState.scala:142)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:1169)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:1169)
//省略
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:58)
- locked <0x0000000081f203c0> (a org.apache.spark.sql.execution.QueryExecution)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:56)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:644)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:70)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:393)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:201)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:866)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:941)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:950)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Found 1 deadlock.
堆栈解释:
main 进程在启动执行过程中, lock 了UserGroupInformation
,然后在执行 org.apache.hadoop.security.UserGroupInformation.getTokens
的时候,等待 Subject
释放。
而同时,LeaseRenewer 线程先 lock 了 Subject
,然后在执行 UserGroupInformation.getCredentialsInternal
的时候,等待 UserGroupInformation
。
主要是上述两个对象造成了死锁,画图表示:
解决死锁的方法,其一就是破坏死锁的条件,如下所示,优化UserGroupInformation.getGroupNames
方法,使其不抢占UserGroupInformation
锁,就不会发生循环依赖,避免发生死锁现象:
看看UserGroupInformation.getGroupNames
方法,它加synchronized是为了保证groups.getGroups的调用是线程安全的:
public synchronized String[] getGroupNames() {
//groups是HashMap类型
//如果groups对象没有初始化,就对groups初始化
ensureInitialized();
try {
//获取Group类成员中user对应的group的值
Set<String> result = new LinkedHashSet<String>
(groups.getGroups(getShortUserName()));
return result.toArray(new String[result.size()]);
} catch (IOException ie) {
LOG.warn("No groups available for user " + getShortUserName());
return new String[0];
}
}
原因是staticUserToGroupsMap是HashMap,线程不安全:
private final Map<String, List<String>> staticUserToGroupsMap =
new HashMap<String, List<String>>();
public List<String> getGroups(final String user) throws IOException {
// No need to lookup for groups of static users
List<String> staticMapping = staticUserToGroupsMap.get(user);
if (staticMapping != null) {
return staticMapping;
}
...
}
为了去掉syncronized锁,但是又要保证线程安全,可以使用AtomicReference引用HashMap,使用AtomicReference写HashMap就是线程安全的。
3. 实现
在Groups类中增加AtomicReference引用:
初始化:
这样,getGroupsNames就可以去掉synchronized,使用AtomicReference变更成员变量:
4. 单元测试
测试使用姿势UGI.getGroups:
@Test (timeout = 30000)
public void testGettingGroups() throws Exception {
UserGroupInformation uugi =
UserGroupInformation.createUserForTesting(USER_NAME, GROUP_NAMES);
assertEquals(USER_NAME, uugi.getUserName());
String[] expected = new String[]{GROUP1_NAME, GROUP2_NAME, GROUP3_NAME};
assertArrayEquals(expected, uugi.getGroupNames());
assertArrayEquals(expected, uugi.getGroups().toArray(new String[0]));
assertEquals(GROUP1_NAME, uugi.getPrimaryGroupName());
}
运行结果:
没有抛出异常,说明测试通过。
cache测试
@Test
public void testGroupsCaching() throws Exception {
// Disable negative cache.
conf.setLong(
CommonConfigurationKeys.HADOOP_SECURITY_GROUPS_NEGATIVE_CACHE_SECS, 0);
Groups groups = new Groups(conf);
groups.cacheGroupsAdd(Arrays.asList(myGroups));
groups.refresh();
FakeGroupMapping.clearBlackList();
FakeGroupMapping.addToBlackList("user1");
// regular entry
// groups.getGroups("me")方法会将FakeGroupMapping中的groups加载到cache中
assertTrue(groups.getGroups("me").size() == 2);
// this must be cached. blacklisting should have no effect.
FakeGroupMapping.addToBlackList("me");
//cache中已经存在groups数据,加入黑名单之后,cache依然生效
assertTrue(groups.getGroups("me").size() == 2);
// ask for a negative entry
try {
//user1已经在黑名单中,对应cache的groups为空,无法将FakeGroupMapping中的groups加载到cache中,因此报异常:No groups found
LOG.error("We are not supposed to get here." + groups.getGroups("user1").toString());
fail();
} catch (IOException ioe) {
if(!ioe.getMessage().startsWith("No groups found")) {
LOG.error("Got unexpected exception: " + ioe.getMessage());
fail();
}
}
// this shouldn't be cached. remove from the black list and retry.
//清除user1的黑名单
FakeGroupMapping.clearBlackList();
//user1的groups又可以cache了
assertTrue(groups.getGroups("user1").size() == 2);
}
运行结果:
说明修改代码后,cache功能正常。
cache去重测试
@Test
public void testGroupsCachingDedup() throws Exception {
// Disable negative cache.
conf.setLong(
CommonConfigurationKeys.HADOOP_SECURITY_GROUPS_NEGATIVE_CACHE_SECS, 0);
Groups groups = new Groups(conf);
String[] myGroups = {"grp1", "grp2", "grp2", "grp1"};
groups.cacheGroupsAdd(Arrays.asList(myGroups));
groups.refresh();
List<String> groupInfo = groups.getGroups("user1");
LOG.info(groupInfo.toString());
assertTrue(groups.getGroups("user1").size() == 2);
}
测试结果:
cache去重代码生效,至此整个整个代码改动生效。
5. 经验总结
出现死锁时,可以思考降低锁的粒度。本文中,将方法锁降级为成员变量锁。