Java内存分析工具实践之Spark客户端线程死锁
  GQ7psP7UJw7k 2023年11月02日 38 0

💡 文章概要:上一篇文章介绍了几种内存分析工具,在工作过程中,使用了这些工具解决很多线上的问题。本篇文章中使用了jstack工具,帮助分析出了spark driver线程死锁导致作业卡住的原因。

1. 背景

spark执行sql时,出现卡住的情况。

2. 分析

对于线程卡住,可以先看看CPU和内存的使用情况,发现正常。

此时需要查看main线程是不是卡住了,因此使用jstack工具查看线程状态,发现了线程死锁的情况:

Found one Java-level deadlock:
=============================
"DataStreamer for file /tmp/spark-events/application_1622622430053_5709820.snappy.inprogress block BP-1460625454-10.90.128.66-1594265842561:blk_3130778014_2091994216":
  waiting to lock monitor 0x00007fa4f4008cf8 (object 0x0000000081f1e548, a org.apache.hadoop.ipc.Client$Connection),
  which is held by "LeaseRenewer:datadmonedata_bu@redpoll"
"LeaseRenewer:datadmonedata_bu@redpoll":
  waiting to lock monitor 0x00007fa756a44878 (object 0x000000008041e3c8, a org.apache.hadoop.security.UserGroupInformation),
  which is held by "main"
"main":
  waiting to lock monitor 0x00007fa4fc003f08 (object 0x000000008041e3e0, a javax.security.auth.Subject),
  which is held by "LeaseRenewer:datadmonedata_bu@redpoll"

Java stack information for the threads listed above:
===================================================
"DataStreamer for file /tmp/spark-events/application_1622622430053_5709820.snappy.inprogress block BP-1460625454-10.90.128.66-1594265842561:blk_3130778014_2091994216":
    at org.apache.hadoop.ipc.Client$Connection.addCall(Client.java:458)
	- waiting to lock <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:370)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1513)
    at org.apache.hadoop.ipc.Client.call(Client.java:1442)
    at org.apache.hadoop.ipc.Client.call(Client.java:1403)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at com.sun.proxy.$Proxy14.getAdditionalDatanode(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:434)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy15.getAdditionalDatanode(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1221)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)
"LeaseRenewer:datadmonedata_bu@redpoll":
	at org.apache.hadoop.security.UserGroupInformation.getCredentialsInternal(UserGroupInformation.java:1583)
	- waiting to lock <0x000000008041e3c8> (a org.apache.hadoop.security.UserGroupInformation)
	at org.apache.hadoop.security.UserGroupInformation.getTokens(UserGroupInformation.java:1548)
	- locked <0x000000008041e3e0> (a javax.security.auth.Subject)
	at org.apache.hadoop.security.SaslRpcClient.getServerToken(SaslRpcClient.java:276)
	at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:219)
	at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
    - locked <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
    - locked <0x0000000081f1e548> (a org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519)
	at org.apache.hadoop.ipc.Client.call(Client.java:1442)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy14.renewLease(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:581)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy15.renewLease(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:919)
    at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
    at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
    at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
	at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
    at java.lang.Thread.run(Thread.java:748)
"main":
    at org.apache.hadoop.security.UserGroupInformation.getTokens(UserGroupInformation.java:1548)
    - waiting to lock <0x000000008041e3e0> (a javax.security.auth.Subject)
    at org.apache.hadoop.security.SaslRpcClient.getServerToken(SaslRpcClient.java:276)
    at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:219)
    at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
    at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
    at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
	- locked <0x0000000081f1eb10> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
	- locked <0x0000000081f1eb10> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519)
    at org.apache.hadoop.ipc.Client.call(Client.java:1442)
    at org.apache.hadoop.ipc.Client.call(Client.java:1403)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1256)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1243)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1231)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:302)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:268)
	- locked <0x0000000081f1edf8> (a java.lang.Object)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:260)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1562)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:308)
    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:304)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:775)
	at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)
	at org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)
	at sun.net.www.protocol.jar.URLJarFile.retrieve(URLJarFile.java:214)
	at sun.net.www.protocol.jar.URLJarFile.getJarFile(URLJarFile.java:71)
	at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:84)
	at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:122)
	at sun.net.www.protocol.jar.JarURLConnection.getJarFile(JarURLConnection.java:89)
	at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:934)
    at sun.misc.URLClassPath$JarLoader.access$800(URLClassPath.java:791)
    at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:876)
    at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:869)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:868)
	at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:819)
    at sun.misc.URLClassPath$3.run(URLClassPath.java:565)
	at sun.misc.URLClassPath$3.run(URLClassPath.java:555)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.misc.URLClassPath.getLoader(URLClassPath.java:554)
    at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
    - locked <0x00000000814a34a8> (a sun.misc.URLClassPath)
    at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:484)
    - locked <0x00000000814a34a8> (a sun.misc.URLClassPath)
    at sun.misc.URLClassPath.access$100(URLClassPath.java:65)
	at sun.misc.URLClassPath$1.next(URLClassPath.java:266)
    at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
    at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
	at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
	at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
	at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration$1.run(VersionHelper12.java:226)
	at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration$1.run(VersionHelper12.java:224)
	at java.security.AccessController.doPrivileged(Native Method)
	at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration.getNextElement(VersionHelper12.java:223)
    at com.sun.naming.internal.VersionHelper12$InputStreamEnumeration.hasMore(VersionHelper12.java:243)
	at com.sun.naming.internal.ResourceManager.getApplicationResources(ResourceManager.java:561)
	- locked <0x0000000081f1f000> (a java.util.WeakHashMap)
	at com.sun.naming.internal.ResourceManager.getInitialEnvironment(ResourceManager.java:244)
	at javax.naming.InitialContext.init(InitialContext.java:240)
	at javax.naming.InitialContext.<init>(InitialContext.java:216)
	at javax.naming.directory.InitialDirContext.<init>(InitialDirContext.java:101)
	at org.apache.hadoop.security.LdapGroupsMapping.getDirContext(LdapGroupsMapping.java:310)
	at org.apache.hadoop.security.LdapGroupsMapping.doGetGroups(LdapGroupsMapping.java:240)
	at org.apache.hadoop.security.LdapGroupsMapping.getGroups(LdapGroupsMapping.java:209)
	- locked <0x00000000804b6840> (a org.apache.hadoop.security.LdapGroupsMapping)
	at org.apache.hadoop.security.Groups$GroupCacheLoader.fetchGroupList(Groups.java:239)
    at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:220)
	at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:208)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	- locked <0x0000000081f1f150> (a com.google.common.cache.LocalCache$StrongWriteEntry)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.apache.hadoop.security.Groups.getGroups(Groups.java:182)
    at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1602)
    - locked <0x000000008041e3c8> (a org.apache.hadoop.security.UserGroupInformation)
    at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:64)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:439)
    at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:715)
    at org.apache.hadoop.hive.ql.session.SessionState.getAuthorizationMode(SessionState.java:1504)
    at org.apache.hadoop.hive.ql.session.SessionState.isAuthorizationModeV2(SessionState.java:1515)
    at org.apache.hadoop.hive.ql.processors.CommandUtil.authorizeCommand(CommandUtil.java:55)
    at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:60)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:874)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:843)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
	- locked <0x00000000814a2a58> (a org.apache.spark.sql.hive.client.IsolatedClientLoader)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
	at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:843)
	at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:833)
	at org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:991)
	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:115)
	at org.apache.spark.sql.internal.SessionResourceLoader.loadResource(SessionState.scala:142)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:1169)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:1169)
  //省略
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:58)
    - locked <0x0000000081f203c0> (a org.apache.spark.sql.execution.QueryExecution)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:56)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:644)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:70)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:393)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
	at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409)
	at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:201)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:866)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:941)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:950)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Found 1 deadlock.

堆栈解释:

main 进程在启动执行过程中, lock 了UserGroupInformation,然后在执行 org.apache.hadoop.security.UserGroupInformation.getTokens 的时候,等待 Subject 释放。

而同时,LeaseRenewer 线程先 lock 了 Subject,然后在执行 UserGroupInformation.getCredentialsInternal 的时候,等待 UserGroupInformation

主要是上述两个对象造成了死锁,画图表示:

Untitled.png

解决死锁的方法,其一就是破坏死锁的条件,如下所示,优化UserGroupInformation.getGroupNames方法,使其不抢占UserGroupInformation锁,就不会发生循环依赖,避免发生死锁现象:

Untitled 1.png

看看UserGroupInformation.getGroupNames 方法,它加synchronized是为了保证groups.getGroups的调用是线程安全的:

public synchronized String[] getGroupNames() {
    //groups是HashMap类型
    //如果groups对象没有初始化,就对groups初始化
    ensureInitialized();
    try {
      //获取Group类成员中user对应的group的值
      Set<String> result = new LinkedHashSet<String>
        (groups.getGroups(getShortUserName()));
      return result.toArray(new String[result.size()]);
    } catch (IOException ie) {
      LOG.warn("No groups available for user " + getShortUserName());
      return new String[0];
    }
  }

原因是staticUserToGroupsMap是HashMap,线程不安全:

private final Map<String, List<String>> staticUserToGroupsMap =
      new HashMap<String, List<String>>();

public List<String> getGroups(final String user) throws IOException {
    // No need to lookup for groups of static users
    List<String> staticMapping = staticUserToGroupsMap.get(user);
    if (staticMapping != null) {
      return staticMapping;
    }
    ...
}

为了去掉syncronized锁,但是又要保证线程安全,可以使用AtomicReference引用HashMap,使用AtomicReference写HashMap就是线程安全的。

3. 实现

在Groups类中增加AtomicReference引用:

Untitled 2.png

初始化:

Untitled 3.png

这样,getGroupsNames就可以去掉synchronized,使用AtomicReference变更成员变量:

Untitled 4.png

4. 单元测试

测试使用姿势UGI.getGroups:

  @Test (timeout = 30000)
  public void testGettingGroups() throws Exception {
    UserGroupInformation uugi =
      UserGroupInformation.createUserForTesting(USER_NAME, GROUP_NAMES);
    assertEquals(USER_NAME, uugi.getUserName());
    String[] expected = new String[]{GROUP1_NAME, GROUP2_NAME, GROUP3_NAME};
    assertArrayEquals(expected, uugi.getGroupNames());
    assertArrayEquals(expected, uugi.getGroups().toArray(new String[0]));
    assertEquals(GROUP1_NAME, uugi.getPrimaryGroupName());
  }

运行结果:

Untitled 5.png

没有抛出异常,说明测试通过。

cache测试

  @Test
  public void testGroupsCaching() throws Exception {
    // Disable negative cache.
    conf.setLong(
        CommonConfigurationKeys.HADOOP_SECURITY_GROUPS_NEGATIVE_CACHE_SECS, 0);
    Groups groups = new Groups(conf);
    groups.cacheGroupsAdd(Arrays.asList(myGroups));
    groups.refresh();
    FakeGroupMapping.clearBlackList();
    FakeGroupMapping.addToBlackList("user1");

    // regular entry
    // groups.getGroups("me")方法会将FakeGroupMapping中的groups加载到cache中
    assertTrue(groups.getGroups("me").size() == 2);

    // this must be cached. blacklisting should have no effect.
    FakeGroupMapping.addToBlackList("me");
    //cache中已经存在groups数据,加入黑名单之后,cache依然生效
    assertTrue(groups.getGroups("me").size() == 2);

    // ask for a negative entry
    try {
      //user1已经在黑名单中,对应cache的groups为空,无法将FakeGroupMapping中的groups加载到cache中,因此报异常:No groups found
      LOG.error("We are not supposed to get here." + groups.getGroups("user1").toString());
      fail();
    } catch (IOException ioe) {
      if(!ioe.getMessage().startsWith("No groups found")) {
        LOG.error("Got unexpected exception: " + ioe.getMessage());
        fail();
      }
    }

    // this shouldn't be cached. remove from the black list and retry.
    //清除user1的黑名单
    FakeGroupMapping.clearBlackList();
    //user1的groups又可以cache了
    assertTrue(groups.getGroups("user1").size() == 2);
  }

运行结果:

Untitled 6.png

说明修改代码后,cache功能正常。

cache去重测试

  @Test
  public void testGroupsCachingDedup() throws Exception {
    // Disable negative cache.
    conf.setLong(
            CommonConfigurationKeys.HADOOP_SECURITY_GROUPS_NEGATIVE_CACHE_SECS, 0);
    Groups groups = new Groups(conf);
    String[] myGroups = {"grp1", "grp2", "grp2", "grp1"};
    groups.cacheGroupsAdd(Arrays.asList(myGroups));
    groups.refresh();
    List<String> groupInfo = groups.getGroups("user1");
    LOG.info(groupInfo.toString());
    assertTrue(groups.getGroups("user1").size() == 2);
  }

测试结果:

Untitled 7.png

cache去重代码生效,至此整个整个代码改动生效

5. 经验总结

出现死锁时,可以思考降低锁的粒度。本文中,将方法锁降级为成员变量锁。

【版权声明】本文内容来自摩杜云社区用户原创、第三方投稿、转载,内容版权归原作者所有。本网站的目的在于传递更多信息,不拥有版权,亦不承担相应法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@moduyun.com

  1. 分享:
最后一次编辑于 2023年11月08日 0

暂无评论

推荐阅读
  KRe60ogUm4le   2024年05月03日   52   0   0 javascala
GQ7psP7UJw7k
最新推荐 更多

2024-05-31