Kubernetes scheduler Node节点亲和性-摩杜云开发者社区

scheduler调度器

之前一直都是围绕着pod来完成的，

pod在k8s里面

所有的资源对象有个分类，其中有个点叫工作负载，k8s自己定义的，工作负载型资源对象

资源对象本身运行着是我们的程序，这个工作负载型说白了就是pod，以及各种pod控制器

工作负载型资源对象：pod和各种pod控制器

今天讲的scheduler的作用是来完成工作负载型的调度，说白了最终就是完成pod调度的，就是为了把pod运行在你指定的节点上

Kubernetes scheduler Node节点亲和性_TCP

上图左边控制器管理器controller-manager 用的比较多，我们讲的各种资源对象事实上本身都包含在一种管理器里面的，除了这个外

kube-proxy

master和node节点上都有个叫kube-proxy的，这个kube-proxy是之前讲service他最主要生成service的规则的，kube-proxy为了生成service的规则的，事实上生成的service，我们能去查询创建资源对象的，事实上生成的各种各样的规则

规则就是ipvs的

[root@k8s-master1 ~]# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.8.0:30447 rr
  -> 192.26.131.133:8443          Masq    1      0          0         
TCP  192.26.159.128:30447 rr
  -> 192.26.131.133:8443          Masq    1      0          0         
TCP  10.96.0.1:443 rr
  -> 172.17.8.0:6443              Masq    1      0          0         
TCP  10.96.0.10:53 rr
  -> 192.28.252.196:53            Masq    1      0          0         
  -> 192.28.252.198:53            Masq    1      0          0         
TCP  10.96.0.10:9153 rr
  -> 192.28.252.196:9153          Masq    1      0          0         
  -> 192.28.252.198:9153          Masq    1      0          0         
TCP  10.97.38.57:443 rr
  -> 192.26.131.132:4443          Masq    1      3          0         
TCP  10.99.149.237:8000 rr
  -> 192.26.131.134:8000          Masq    1      0          0         
TCP  10.102.107.62:443 rr
  -> 192.26.131.133:8443          Masq    1      0          0         
TCP  10.103.202.101:5473 rr
  -> 172.17.8.2:5473              Masq    1      0          0         
UDP  10.96.0.10:53 rr
  -> 192.28.252.196:53            Masq    1      0          0         
  -> 192.28.252.198:53            Masq    1      0          0

我们做的都是ipvs规则或者叫lvs规则，我们装集群的就改成了把kube-proxy模式改为ipvs

我们创建service就是创建了个vip 生成了ip地址这里叫clust ip 或者叫servic ip 通常情况下会有ip地址除了特殊的无头service外

这个ip是如何到后端pod呢，我们请求service的时候，他会把我们请求调度到后端的pod节点上呢？为什么呢？

就是因为里面使用的规则是ipvs

service ip 就类似于 lvs 的vip，对应的pod ip 地址就是 realserver

kube-proxy的作用就是为了产生我们呢的lvs规则的

kube-proxy这个组件最终就是为了生成这种

我们创建service 怎么从service到后端呢？

controller-manager

controller-manager 就是我们各种各样的资源控制器，都是由controller-manager来管理的，它叫控制器管理器

API Server

API Server k8s集群所有的组件都是和API Server通信的

API Server 可以理解为一个集群的网关入口，集群任何一个组件都是和APIServer通信的，作为一个中枢，或者网关入口，就像上网需要连接路由器，出网是个正向代理别人访问我是反向代理 APIServer 类似于一个网关出口

ETCD

etcd集群是作为存储，只有API Server能和他通信

kubelet

在node节点还有kubelet

kubelet作为一个客户端节点来使用的

今天主要讲的是scheduler调度器

scheduler

作用：把我们工作负载（pod）和对应的集群里面某一个node节点来去绑定，只有绑定成功了以后kubelet才会去调用容器运行时（docker/containerd）去创建我们对应的pod

今天来看下scheduler具体怎么工作的，如何把我们pod绑定到node节点上的，

Kubernetes scheduler Node节点亲和性_选择器_02

上图 scheduler在集群master和node节点之间，搭建了一个桥梁，他就是完成我们即将要创建的pod，在pod队列里面，我们在创建pod的时候首先（可能有很多客户端程序在运行同时要创建大量的pod）然后就会有个调度队列，从队列里面我们调度器本身从队列里面一个一个来获取你要调度pod的本身详细的参数，这些参数是从etcd里面通过APIServer方式拿出来的，然后scheduler 还维护了一个节点列表，这个节点列表是node节点，我们整个集群node节点列表，这个列表除了ip和主机名以外还包含很多和node相关的元数据，比如数据信息内存还有多大cpu还有多大，上面有什么标签，有什么污点之类的信息，这是scheduler调度器的大概工作逻辑。

在k8s集群1.15之前scheduler调度器有两个调度规则，一个是经典调度

经典调度：

分两步
1.预选，可以理解为一个方法或者一个函数，
2.优选，
预选：我们在我们的pod创建之前调度器会先预选一下，就是看后端有没有不符合的要求的节点，一旦有不符合要求的节点，他就不会往后面调度了，比如当前集群有10个node节点，可能第一步先预选以外，可能有两个或三个节点不符合要求，那么就不会往这两三个节点调度，预算结束后会有个优选，比如后面还有七台节点，能够符合要求，有个优选阶段，会给他算分值，比如七个节点根据不同的优选规则，他会计算每一个节点的优先级最高（分值最高）最终会把得分最高的节点和你要创建的pod发生绑定

这里面的本身的原理非常多，里面包含各种各样的队列，各种各样的算法，还有后面讲的扩展插件，在1.15以上的版本开始k8s就不光是这两种调度策略了，调度器包含了各种各样的调度插件，我们还可以作为一个管理员去自定义调度器插件，定义符合自己要求的插件。（这个不是一个普通运维能做的事，他需要写代码，需要对于官方开放出来的api接口进行开发）

我们今天讲的是：

k8s给我们提供的，基于自己的调度策略能够通过节点亲和度，Pod亲和度，以及节点的污点与Pod容忍度，这三种方式来去把对应的pod来去创建你想要的符合要求的node节点上去

1.节点亲和调度

节点选择器

pod资源可以使用spec.nodeName直接指定要运行的node节点，也可以基于spec.nodeSelector指定的标签选择器对过滤符合条件的节点作为目标节点

我们可以把我们的pod创建到你想要求的节点上去

节点亲和是调度程序用来确定Pod对象调度位置（那个或那类节点）的调度法则，就是运行在那个节点，这些规则基于节点上的自定义标签和Pod对象上指定的标签选择器进行定义（标签是我们所有资源对象都可以打标签）节点上的自定义标签和Pod对象上指定的标签选择器进行定义，就是pod的标签选择器和node节点的标签最终发生一个匹配关系，这就是基于node上的标签和你的pod里面的标签选择器，支持这种标签选择器的有两个，除了spec.nodeSelector还有spec.nodeName，spec.nodeName比较简单暴力，直接定义你要选择的node节点是谁就可以，这种方式有个缺点，他只能同时指定一个nodename 比如我们用deployment pod控制器运行多个pod的时候比如三个副本，那三个副本同时只能运行到这一个node节点里面就是nodeName只能在一台机器上来完成

还有个NodeAffinity node亲和度，nginx有个cpu亲和度，node亲和度，节点亲和调度机制支持Pod资源定义自身对期望运行的某类节点的倾向性，（就是我们作为一个管理员我们定义一个资源对象的时候我想把它运行在那个节点上）

倾向于运行指定类型的节点即为“亲和”关系，否则即为“反亲和”关系。

（就是集群上有很多节点我想把pod运行在那些节点上，我还可以不想让pod运行在那些节点上，反亲和度）

亲和度里面有两种关系：强制亲和，还有个首选亲和

强制：我只让你在这台节点运行，如果我定义的策略如果没有符合的node节点我就不运行了，我整个就会运行失败，

首选亲和：相对更加柔性一点，类似于软限制硬限制，我想让他运行的node节点没有符合要求的时候，它可以在集群上选择一个节点运行起来

这就是强制和首选的差别

强制和首选的概念，不管是节点亲和度，还是pod亲和度，还是节点污点容忍度都会有这两个概念，强制和首选的

首选亲和度更加柔性一点，他不至于最终你的pod无法运行

[root@k8s-master1 pod]# kubectl explain pod.spec
KIND:     Pod
VERSION:  v1

RESOURCE: spec <Object>

DESCRIPTION:
     Specification of the desired behavior of the pod. More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status

     PodSpec is a description of a pod.

FIELDS:
   activeDeadlineSeconds    <integer>
     Optional duration in seconds the pod may be active on the node relative to
     StartTime before the system will actively try to mark it failed and kill
     associated containers. Value must be a positive integer.

   affinity <Object>
     If specified, the pod's scheduling constraints

[root@k8s-master1 pod]# kubectl explain pod.spec.affinity
KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
   #node的
     Describes node affinity scheduling rules for the pod.

   podAffinity  <Object>
   #pode的
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity  <Object>
   #pod 反亲和度
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

一个个看

node亲和度

[root@k8s-master1 pod]# kubectl explain pod.spec.affinity.nodeAffinity
KIND:     Pod
VERSION:  v1

RESOURCE: nodeAffinity <Object>

DESCRIPTION:
     Describes node affinity scheduling rules for the pod.

     Node affinity is a group of node affinity scheduling rules.

FIELDS:
   preferredDuringSchedulingIgnoredDuringExecution  <[]Object>
   #首选亲和
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

   requiredDuringSchedulingIgnoredDuringExecution   <Object>
   #强制亲和
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to an update), the system may or may not try
     to eventually evict the pod from its node.

pod亲和度也有这俩参数

[root@k8s-master1 pod]# kubectl explain pod.spec.affinity.podAffinity
KIND:     Pod
VERSION:  v1

RESOURCE: podAffinity <Object>

DESCRIPTION:
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

     Pod affinity is a group of inter pod affinity scheduling rules.

FIELDS:
   preferredDuringSchedulingIgnoredDuringExecution  <[]Object>
   #首选亲和
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node has pods
     which matches the corresponding podAffinityTerm; the node(s) with the
     highest sum are the most preferred.

   requiredDuringSchedulingIgnoredDuringExecution   <[]Object>
   #强制亲和
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to a pod label update), the system may or
     may not try to eventually evict the pod from its node. When there are
     multiple elements, the lists of nodes corresponding to each podAffinityTerm
     are intersected, i.e. all terms must be satisfied.

Kubernetes scheduler Node节点亲和性_TCP_03

看图都是强制亲和度

如图：强制亲和度：比如我想运行一个pod（左下一），我想让他运行在有zonc=foo标签的节点上去，看图，目前就左1和左2 符合要求，右一是不符合要求的，

右下一也是强制亲和度：只是说他的条件更多了，他的第一个要求是标签zone包含foo bar 首先这个要求三个node都符合，第二个条件符合gpu标签的只有1和3符合

这就是强制亲和度

节点选择器：

spec.nodeName 和spec.nodeSelector

nodeName就是直接指定node节点名称

nodeSelector是指定标签，指定一个有这个标签的node

看下nodeName

[root@k8s-master1 pod]# kubectl explain pod.spec.nodeName
KIND:     Pod
VERSION:  v1

FIELD:    nodeName <string>

DESCRIPTION:
     NodeName is a request to schedule this pod onto a specific node. If it is
     non-empty, the scheduler simply schedules this pod onto that node, assuming
     that it fits resource requirements.

nodeSelector

[root@k8s-master1 pod]# kubectl explain pod.spec.nodeSelector
KIND:     Pod
VERSION:  v1

FIELD:    nodeSelector <map[string]string>

DESCRIPTION:
     NodeSelector is a selector which must be true for the pod to fit on a node.
     Selector which must match a node's labels for the pod to be scheduled on
     that node. More info:
     https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

开始操作一下

nodeSelector

我们按照上面那个图给三个node加标签

node1 加gpu=true zonc=foo

node2 加zone=foo

node3 加gpu=true zonc=bar

标签也可以只加标签名，不给value也可以比如

[root@k8s-master1 ~]# kubectl label nodes k8s-node1.guoguo.com gpu=
node/k8s-node1.guoguo.com labeled

这种标签应用：有两种运行需要特殊资源的

比如作为数据库运行的 io密集型一种pod 程序，要求磁盘性能要好，比如加个磁盘类型ssd的，

还有一种计算型的，比如人工智能，深度学习这种，需要大量cpu资源的，需要gpu的芯片去支撑，我就给他加cpu的标签去寻找带有gpu标签的node

[root@k8s-master1 ~]# kubectl get node k8s-node1.guoguo.com --show-labels
NAME                   STATUS   ROLES    AGE   VERSION   LABELS
k8s-node1.guoguo.com   Ready    <none>   12h   v1.26.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,gpu=,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1.guoguo.com,kubernetes.io/os=linux
root@k8s-master1 ~]# kubectl get nodes -l gpu=
NAME                   STATUS   ROLES    AGE   VERSION
k8s-node1.guoguo.com   Ready    <none>   12h   v1.26.3

写一个pod

apiVersion: v1
kind: Pod
metadata:
  name: pod-gpu-test-1
spec:
  containers:
  - name: nginx
    image: images.guoguo.com/apps/nginx:1.22.1
  nodeSelector: #节点选择器
    gpu: ""
    #指定调度到node标签为gpu 内容为空的

[root@k8s-master1 9-5]# kubectl get pods -owide
NAME             READY   STATUS    RESTARTS      AGE   IP               NODE                   NOMINATED NODE   READINESS GATES
pod-gpu-test-1   1/1     Running   0             16s   192.26.131.133   k8s-node1.guoguo.com   <none>           <none>
#运行在node1节点上这是没问题的

看下describe

[root@k8s-master1 9-5]# kubectl describe pods pod-gpu-test-1 

 kube-api-access-lxlj8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              gpu= #这里就包含了这个标签
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m44s  default-scheduler  Successfully assigned default/pod-gpu-test-1 to k8s-node1.guoguo.com
  Normal  Pulled     2m43s  kubelet            Container image "images.guoguo.com/apps/nginx:1.22.1" already present on machine
  Normal  Created    2m43s  kubelet            Created container nginx
  Normal  Started    2m43s  kubelet            Started container nginx
[root@k8s-master1 9-5]#

这是节点的标签选择器的方式来运行起来的一个pod

nodeName

apiVersion: v1
kind: Pod
metadata:
  name: nodename-test-1
spec:
  containers:
  - name: nginx
    image: images.guoguo.com/apps/nginx:1.22.1
  nodeName: k8s-node2.guoguo.com #选择node2

[root@k8s-master1 9-5]# kubectl get pod -owide
NAME              READY   STATUS    RESTARTS      AGE     IP               NODE                   NOMINATED NODE   READINESS GATES
nodename-test-1   1/1     Running   0             32s     192.28.252.199   k8s-node2.guoguo.com   <none>           <none>

注意标签选择器的名称必须是全称比如

[root@k8s-master1 9-5]# kubectl get nodes
NAME                   STATUS   ROLES           AGE   VERSION
k8s-master1            Ready    control-plane   12h   v1.26.3
k8s-node1.guoguo.com   Ready    <none>          12h   v1.26.3
k8s-node2.guoguo.com   Ready    <none>          12h   v1.26.3
k8s-node3.guoguo.com   Ready    <none>          12h   v1.26.3

强制节点亲和

pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution <Object>

他的值是个列表可以由一个或多个nodeSelectorTerms对象（node选择器项）组成，彼此间为“逻辑”或“关系”。nodeSelectorTerms用于定义节点选择器，其值为对象列表，它支持matchExpressions和matchFields两种复杂的表达机制，我们之前讲deployment里面 pod控制器里面的标签选择器差不多前面的matchExpressions一样后面的matchFields不一样 deployment的是matchLabels

matchExpressions 标签选择器表达式

就是in 包含 not in 不包含存在或者不存在选择器表达式

maechFields

类似于字段 key:value key:value 各种各样的字段

这是两种匹配方式

如果你有多项，多项之间本身是或的关系

写个强制亲和

先查看一下字段

[root@k8s-master1 9-5]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>

DESCRIPTION:
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to an update), the system may or may not try
     to eventually evict the pod from its node.

     A node selector represents the union of the results of one or more label
     queries over a set of nodes; that is, it represents the OR of the selectors
     represented by the node selector terms.

FIELDS:
   nodeSelectorTerms    <[]Object> -required-
     Required. A list of node selector terms. The terms are ORed.
     #该字段是必须的,它包含一个 node selector terms 的数组。这个数组表达的是多个 node selector 的逻辑“或”关系。
     #每个 term 包含一个 matchExpressions 或 matchFields,它们表达的是多个 node selector 的逻辑“与”关系。
     
#如果在调度时没有节点满足这里的affinity要求,则pod不会被调度。

#但如果在运行中后期节点标签变更导致不再满足affinity,系统是否会将pod驱逐出该节点是不确定的。

#所以这种亲和性主要用于保证调度阶段节点的选择,但运行阶段不会持续保证

继续看

[root@k8s-master1 9-5]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
KIND:     Pod
VERSION:  v1

RESOURCE: nodeSelectorTerms <[]Object>

DESCRIPTION:
     Required. A list of node selector terms. The terms are ORed.

     A null or empty node selector term matches no objects. The requirements of
     them are ANDed. The TopologySelectorTerm type implements a subset of the
     NodeSelectorTerm.

FIELDS:
   matchExpressions <[]Object>
   #表达式匹配  下面继续看
   #基于节点的标签来指定匹配条件,通过键值对形式指定标签及对应的条件。
     A list of node selector requirements by node's labels.

   matchFields  <[]Object>
   #基于节点的字段来指定匹配条件,字段包括如 hostname、内存容量、CPU 核数等,通过键值对形式指定字段及对应条件。
     A list of node selector requirements by node's fields.

[root@k8s-master1 9-5]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchExpressions
KIND:     Pod
VERSION:  v1

RESOURCE: matchExpressions <[]Object>

DESCRIPTION:
     A list of node selector requirements by node's labels.

     A node selector requirement is a selector that contains values, a key, and
     an operator that relates the key and values.

FIELDS:
   key  <string> -required-
   #标签名 要匹配的节点标签的键
     The label key that the selector applies to.

   operator <string> -required-
   #表示键和值之间关系的运算符,可取值包括 In、NotIn、Exists、DoesNotExist、Gt、Lt。
     Represents a key's relationship to a set of values. Valid operators are In,
     NotIn, Exists, DoesNotExist. Gt, and Lt.

     Possible enum values:
     - `"DoesNotExist"`  #是不是不存在的  不存在给定的标签键 
     - `"Exists"`  #是不是存在  存在给定的标签键
     - `"Gt"` #是不是大于的   标签值大于给定值
     - `"In"` #包含的    标签值在给定的数组值中
     - `"Lt"` #是不是小于的 标签值小于给定值
     - `"NotIn"` #是不是不包含  标签值不在给定的数组值中

   values   <[]string>
   #根据运算符需要匹配的标签值列表
     An array of string values. If the operator is In or NotIn, the values array
     must be non-empty. If the operator is Exists or DoesNotExist, the values
     array must be empty. If the operator is Gt or Lt, the values array must
     have a single element, which will be interpreted as an integer. This array
     is replaced during a strategic merge patch.

写一个

[root@k8s-master1 9-5]# cat node-affinity-required-1.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-required   #rs是nginx-required+哈希值 pod的最终名字 为rs名字+哈希值  deployment的名字
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demoapp   #标签
      ctlr: node-affinity-required  #标签
  template:
    metadata:
      name: nginx-pod-required
      labels:
        app: demoapp   #标签
        ctlr: node-affinity-required  #标签
    spec:
      containers:
      - name: nginx
        image: images.guoguo.com/apps/nginx:1.22.1
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet: 
            port: 80
            path: ./index.html
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
      affinity:
        nodeAffinity: #节点亲和度
          requiredDuringSchedulingIgnoredDuringExecution: #强制亲和  required必须
            nodeSelectorTerms: #节点标签选择器
            - matchExpressions: #表达式 上面有具体参数
              - key: gpu #标签    结合下面的意思就是  调度的node节点上必须有gpu这个标签 否则不调度
                operator: Exists #必须存在 具体解释看上面
              - key: ode-role.kubernetes.io/master
              #这个标签是master节点独有的标签，意思是不pod不向master节点调度
              #在1.24以上的版本不使用master标签了   因为1.24以上黑人的原因 种族歧视
              #node-role.kubernetes.io/control-plane=
                operator: DoesNotExist  #不存在 具体看上面解释

因为我们只向node1节点添加了gpu标签所以三个副本全部调度到node1节点上去

[root@k8s-master1 9-5]# kubectl get pods -owide
NAME                              READY   STATUS    RESTARTS   AGE     IP               NODE                   NOMINATED NODE   READINESS GATES
nginx-required-75fc78879f-2lzvw   1/1     Running   0          2m22s   192.26.131.135   k8s-node1.guoguo.com   <none>           <none>
nginx-required-75fc78879f-4wgvj   1/1     Running   0          2m22s   192.26.131.134   k8s-node1.guoguo.com   <none>           <none>
nginx-required-75fc78879f-brd99   1/1     Running   0          2m22s   192.26.131.136   k8s-node1.guoguo.com   <none>           <none>

这就是node强制亲和度只有在符合要求的节点上运行，如果没有符合要求的节点，那么就不运行了

比如我们将key值改为一个不存在了没有node有这个key值那么就启动不了

我们删除原有了然后改下

NAME                              READY   STATUS    RESTARTS   AGE
nginx-required-75fc78879f-fzpq2   1/1     Running   0          7m
nginx-required-75fc78879f-gkkc5   1/1     Running   0          7m
nginx-required-75fc78879f-j5hgz   1/1     Running   0          7m
nginx-required-7cf745dbfb-p74v4   0/1     Pending   0          3m23s
#如果说我没有符合要求的，我原来创建的 不会删掉

[root@k8s-master1 9-5]# kubectl describe pods nginx-required-7cf745dbfb-p74v4 
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  5m22s  default-scheduler  0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match  Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  #3 个节点与 Pod 的节点亲和力/选择器不匹配  所以调度不成功

节点强制亲和度

除了这种强制亲和度首选亲和度以外还有个对资源本身有要求，如果你当前节点没有符合资源对象的要求外，他也会创建失败的

看下我们呢节点的资源占用多少

[root@k8s-master1 9-5]# kubectl top nodes
NAME                   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
k8s-master1            161m         8%     1218Mi          65%       
k8s-node1.guoguo.com   49m          2%     585Mi           31%       
k8s-node2.guoguo.com   57m          2%     739Mi           39%       
k8s-node3.guoguo.com   63m          3%     535Mi           28%

三台node节点资源都是2核心2G内存

现在剩余不足2核2G

写一个限制2核 2G 的这样肯定起不来

[root@k8s-master1 9-5]# cat pod-resourcefits-1.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-resource
spec:
  replicas: 3
  selector:
    matchLabels:
      apps: demoapp
      ctlr: node-affinity-and-resourcefits
  template:
    metadata:
      name: resource-nginx
      labels:
        apps: demoapp
        ctlr: node-affinity-and-resourcefits
    spec:
      restartPolicy: Always
      containers:
      - name: nginx
        image: images.guoguo.com/apps/nginx:1.22.1
        ports:
        - containerPort: 80
        startupProbe:
          exec:
            command:
              - cat
              - /etc/hosts
          failureThreshold: 2
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
        livenessProbe:
          tcpSocket:
            port: 80
          failureThreshold: 2
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
        readinessProbe:
          httpGet: 
            path: ./index.html
            port: 80
            scheme: HTTP
          failureThreshold: 2
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
        imagePullPolicy: Always
        resources:  #资源
          limits: #限制
            cpu: 2  #cpu 2核心
            memory: 2Gi  #内存2G
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:  #节点选择
            - matchExpressions:   #匹配表达式 这里面用的是标签匹配 匹配规则在上面详细介绍了
              - key: gpu  #标签名
                operator: Exists  #存在  意思是存在gpu标签的node 才有资格被调度到
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist

我们给node2节点也加一个标签 gpu

[root@k8s-master1 9-5]# kubectl label nodes k8s-node2.guoguo.com gpu=
node/k8s-node2.guoguo.com labeled

启动查看

[root@k8s-master1 9-5]# kubectl get pods
NAME                            READY   STATUS    RESTARTS   AGE
pod-resource-577d98454f-5fcjc   0/1     Pending   0          4s
pod-resource-577d98454f-7t5hb   0/1     Pending   0          4s
pod-resource-577d98454f-qd89s   0/1     Pending   0          4s

所有的都为pending 因为当前节点没有符合要求的

我们给node1节点加点cpu和内存

Kubernetes scheduler Node节点亲和性_Pod_04

开机然后再看已经有一个已经调度到node1上去了

[root@k8s-master1 9-5]# kubectl get pods -owide
NAME                            READY   STATUS    RESTARTS   AGE   IP               NODE                   NOMINATED NODE   READINESS GATES
pod-resource-577d98454f-4t7pt   0/1     Pending   0          10m   <none>           <none>                 <none>   
pod-resource-577d98454f-mbfzr   1/1     Running   0          10m   192.26.131.145   k8s-node1.guoguo.com   <none>   
pod-resource-577d98454f-pkfl5   0/1     Pending   0          10m   <none>           <none>                 <none>

以上是强制亲和

还有

首选节点亲和

如果你对应的标签没有能够符合要求的节点，他会寻找一个节点运行起来

条件会有多个，就可以多个条件之间设置一个权重，就是条件里面。

可以设置多个条件，每个条件设置一个权重，优先级，比如两个优先级都没有符合要求的，但是最终会按照优先级高的找一个，如果两个都没有符合要求的就随机找一个，目前资源最清闲的node节点

首选亲和度，相对来说比较柔性的，

查看一下

[root@k8s-master1 9-5]# kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: preferredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

     An empty preferred scheduling term matches all objects with implicit weight
     0 (i.e. it's a no-op). A null preferred scheduling term matches no objects
     (i.e. is also a no-op).

FIELDS:
   preference   <Object> -required-
   #条件  下面详细看
     A node selector term, associated with the corresponding weight.

   weight   <integer> -required-
   #权重 1到100的整数  值越大优先级越高
     Weight associated with matching the corresponding nodeSelectorTerm, in the
     range 1-100.

[root@k8s-master1 9-5]# kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution.preference
KIND:     Pod
VERSION:  v1

RESOURCE: preference <Object>

DESCRIPTION:
     A node selector term, associated with the corresponding weight.

     A null or empty node selector term matches no objects. The requirements of
     them are ANDed. The TopologySelectorTerm type implements a subset of the
     NodeSelectorTerm.

FIELDS:
   matchExpressions <[]Object>
   #和强制亲和一样的 也有节点选择 这个是标签的 
   #基于节点标签的匹配表达式,用于指定首选的节点标签
     A list of node selector requirements by node's labels.

   matchFields  <[]Object>
   #基于节点字段的匹配表达式,来指定优选的节点字段。  选节点的
     A list of node selector requirements by node's fields.

把标签设置为下面这样

node1 标签  gpu=
node2 标签 region=bar
node3 标签 无

写一个

[root@k8s-master1 9-5]# cat pod-preferred-node-1.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: preferred-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
      version: v1
  template:
    metadata:
      name: nginx-temp
      labels: 
        app: nginx
        version: v1
    spec:
      containers:
      - name: nginx
        image: images.guoguo.com/apps/nginx:1.22.1
        ports:
        - containerPort: 80
      affinity: #亲和
        nodeAffinity:  #节点亲和
          preferredDuringSchedulingIgnoredDuringExecution:  #首选节点亲和
          - weight: 60 #权重60
            preference: #条件
              matchExpressions: #标签选择器表达式 
              - key: region  #标签名
                values: ["bar","foo"]  #值为bar和foo
                operator: In  #包含
          - weight: 30  #权重30 
            preference:
              matchExpressions:
              - key: gpu
                operator: Exists #必须有

首先他会去找一个标签为region 值为 bar或者foo 的都可以还一个要求是标签为gpu的

region 权重为60 gpu的权重为30

如果说没有两个标签都存在的节点那么首先选择region的标签节点也就是首先选择 node2 其次选择node1 最后没办法会选node3

[root@k8s-master1 9-5]# kubectl get pods -owide
NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE                 
preferred-nginx-694f8bf55-dtv6b   1/1     Running   0          30s   192.28.252.211   k8s-node2.guoguo.com 
preferred-nginx-694f8bf55-k7kwj   1/1     Running   0          30s   192.26.131.151   k8s-node1.guoguo.com 
preferred-nginx-694f8bf55-nx7qx   1/1     Running   0          30s   192.26.131.150   k8s-node1.guoguo.com 
preferred-nginx-694f8bf55-s99jm   1/1     Running   0          30s   192.28.252.213   k8s-node2.guoguo.com 
preferred-nginx-694f8bf55-wctbn   1/1     Running   0          30s   192.28.252.212   k8s-node2.guoguo.com

node1节点运行2个 node2节点运行了3个 node3节点没有运行

首先符合上图两个要求的没有也就是没有任何一个node 符合既有region 和gpu

node1 有 gpu 权重30

node2 有region 权重60

node2 啥也没有权重0

首选亲和度，如何没有任何节点符合要求他会条件进行选择权重比较高的节点来去运行pod（工作负载）

以上就是节点亲和度，在创建pod的时候都是把pod运行在指定的node上的，标签是打在node节点上的，这就是node节点亲和度