k8s pod unable to connect to clusterip service

12/28/2019

I am using the airflow helm chart to run airflow on k8s. However, the web pod can't seem to connect to postgresql. The odd thing is, that other pods can.

I've cobbled together small scripts to check, and this is what I found:

[root@ip-10-56-173-248 bin]# cat checkpostgres.sh
docker exec -u root $1 /bin/nc -zvw2 airflow-postgresql 5432
[root@ip-10-56-173-248 bin]# docker ps --format '{{.Names}}\t{{.ID}}'|grep k8s_airflow|grep default|awk '{printf("%s ",$1); system("checkpostgres.sh " $2)}'
k8s_airflow-web_airflow-web-57c6dcd77b-dvjmv_default_67d74586-284b-11ea-8021-0249931777ef_74 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) : Connection timed out
k8s_airflow-worker_airflow-worker-0_default_67e1703a-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-scheduler_airflow-scheduler-5d9b688ccf-zdjdl_default_67d3fab4-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-postgresql_airflow-postgresql-76c954bb7f-gwq68_default_67d1cf3d-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-redis_airflow-redis-master-0_default_67d9aa36-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (?) open
k8s_airflow-flower_airflow-flower-79c999764d-d4q58_default_67d267e2-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open

And this is my k8s version info:

➜  ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:11:03Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

When I do a nslookup on the pod name, it seems to work fine:

# nslookup airflow-postgresql
Server:     172.20.0.10
Address:    172.20.0.10#53

Non-authoritative answer:
Name:   airflow-postgresql.default.svc.cluster.local
Address: 172.20.166.209

EDIT: As requested, here is the EKS setup:

amazon-eks-nodegroup.yaml:

---
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Amazon EKS - Node Group'

Parameters:

  KeyName:
    Description: The EC2 Key Pair to allow SSH access to the instances
    Type: AWS::EC2::KeyPair::KeyName

  NodeImageId:
    Type: AWS::EC2::Image::Id
    Description: AMI id for the node instances.

  NodeInstanceType:
    Description: EC2 instance type for the node instances
    Type: String
    Default: t3.medium
    AllowedValues:
    - t2.small
    - t2.medium
    - t2.large
    - t2.xlarge
    - t2.2xlarge
    - t3.nano
    - t3.micro
    - t3.small
    - t3.medium
    - t3.large
    - t3.xlarge
    - t3.2xlarge
    - m3.medium
    - m3.large
    - m3.xlarge
    - m3.2xlarge
    - m4.large
    - m4.xlarge
    - m4.2xlarge
    - m4.4xlarge
    - m4.10xlarge
    - m5.large
    - m5.xlarge
    - m5.2xlarge
    - m5.4xlarge
    - m5.12xlarge
    - m5.24xlarge
    - c4.large
    - c4.xlarge
    - c4.2xlarge
    - c4.4xlarge
    - c4.8xlarge
    - c5.large
    - c5.xlarge
    - c5.2xlarge
    - c5.4xlarge
    - c5.9xlarge
    - c5.18xlarge
    - i3.large
    - i3.xlarge
    - i3.2xlarge
    - i3.4xlarge
    - i3.8xlarge
    - i3.16xlarge
    - r3.xlarge
    - r3.2xlarge
    - r3.4xlarge
    - r3.8xlarge
    - r4.large
    - r4.xlarge
    - r4.2xlarge
    - r4.4xlarge
    - r4.8xlarge
    - r4.16xlarge
    - x1.16xlarge
    - x1.32xlarge
    - p2.xlarge
    - p2.8xlarge
    - p2.16xlarge
    - p3.2xlarge
    - p3.8xlarge
    - p3.16xlarge
    - r5.large
    - r5.xlarge
    - r5.2xlarge
    - r5.4xlarge
    - r5.12xlarge
    - r5.24xlarge
    - r5d.large
    - r5d.xlarge
    - r5d.2xlarge
    - r5d.4xlarge
    - r5d.12xlarge
    - r5d.24xlarge
    - z1d.large
    - z1d.xlarge
    - z1d.2xlarge
    - z1d.3xlarge
    - z1d.6xlarge
    - z1d.12xlarge
    ConstraintDescription: Must be a valid EC2 instance type

  NodeAutoScalingGroupMinSize:
    Type: Number
    Description: Minimum size of Node Group ASG.
    Default: 1

  NodeAutoScalingGroupMaxSize:
    Type: Number
    Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity.
    Default: 4

  NodeAutoScalingGroupDesiredCapacity:
    Type: Number
    Description: Desired capacity of Node Group ASG.
    Default: 3

  NodeVolumeSize:
    Type: Number
    Description: Node volume size
    Default: 20

  ClusterName:
    Description: The cluster name provided when the cluster was created. If it is incorrect, nodes will not be able to join the cluster. i.e. "eks"
    Type: String

  Environment:
    Description: the Environment value provided when the cluster was created. i.e. "dev"
    Type: String

  BootstrapArguments:
    Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami
    Default: ""
    Type: String

  VpcId:
    Description: The VPC of the worker instances stack reference
    Type: String

  Subnets:
    Description: The subnets where workers can be created.
    Type: String

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      -
        Label:
          default: "EKS Cluster"
        Parameters:
          - ClusterName
      -
        Label:
          default: "dev"
        Parameters:
          - Environment
      -
        Label:
          default: "Worker Node Configuration"
        Parameters:
          - NodeAutoScalingGroupMinSize
          - NodeAutoScalingGroupDesiredCapacity
          - NodeAutoScalingGroupMaxSize
          - NodeInstanceType
          - NodeImageId
          - NodeVolumeSize
          - KeyName
          - BootstrapArguments
      -
        Label:
          default: "Worker Network Configuration"
        Parameters:
          - VpcId
          - Subnets

Resources:

  NodeInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: !Sub "${ClusterName}-${Environment}-cluster-node-instance-profile"
      Path: "/"
      Roles:
      - !Ref NodeInstanceRole

  NodeInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "${ClusterName}-${Environment}-cluster-node-instance-role"
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ec2.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      Policies:
        -
          PolicyName: "change-r53-recordsets"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: Allow
                Action: route53:ChangeResourceRecordSets
                Resource: !Sub
                  - "arn:aws:route53:::hostedzone/${ZoneId}"
                  - {ZoneId: !ImportValue DNS-AccountZoneID}
        -
          PolicyName: "list-r53-resources"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: Allow
                Action:
                  - route53:ListHostedZones
                  - route53:ListResourceRecordSets
                Resource: "*"

  NodeSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for all nodes in the cluster
      GroupName: !Sub "${ClusterName}-${Environment}-cluster-security-group"
      VpcId:
        Fn::ImportValue:
          !Sub ${VpcId}-vpcid
      Tags:
      - Key: !Sub "kubernetes.io/cluster/${ClusterName}-${Environment}-cluster"
        Value: 'owned'

  NodeSecurityGroupIngress:
    Type: AWS::EC2::SecurityGroupIngress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow node to communicate with each other
      GroupId: !Ref NodeSecurityGroup
      SourceSecurityGroupId: !Ref NodeSecurityGroup
      IpProtocol: '-1'
      FromPort: 0
      ToPort: 65535

  NodeSecurityGroupFromControlPlaneIngress:
    Type: AWS::EC2::SecurityGroupIngress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow worker Kubelets and pods to receive communication from the cluster control plane
      GroupId: !Ref NodeSecurityGroup
      SourceSecurityGroupId:
        Fn::ImportValue:
          !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
      IpProtocol: tcp
      FromPort: 1025
      ToPort: 65535

  ControlPlaneEgressToNodeSecurityGroup:
    Type: AWS::EC2::SecurityGroupEgress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow the cluster control plane to communicate with worker Kubelet and pods
      GroupId:
        Fn::ImportValue:
          !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
      DestinationSecurityGroupId: !Ref NodeSecurityGroup
      IpProtocol: tcp
      FromPort: 1025
      ToPort: 65535

  NodeSecurityGroupFromControlPlaneOn443Ingress:
    Type: AWS::EC2::SecurityGroupIngress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow pods running extension API servers on port 443 to receive communication from cluster control plane
      GroupId: !Ref NodeSecurityGroup
      SourceSecurityGroupId:
        Fn::ImportValue:
          !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
      IpProtocol: tcp
      FromPort: 443
      ToPort: 443

  ControlPlaneEgressToNodeSecurityGroupOn443:
    Type: AWS::EC2::SecurityGroupEgress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow the cluster control plane to communicate with pods running extension API servers on port 443
      GroupId:
        Fn::ImportValue:
          !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
      DestinationSecurityGroupId: !Ref NodeSecurityGroup
      IpProtocol: tcp
      FromPort: 443
      ToPort: 443

  ClusterControlPlaneSecurityGroupIngress:
    Type: AWS::EC2::SecurityGroupIngress
    DependsOn: NodeSecurityGroup
    Properties:
      Description: Allow pods to communicate with the cluster API Server
      GroupId:
        Fn::ImportValue:
          !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
      SourceSecurityGroupId: !Ref NodeSecurityGroup
      IpProtocol: tcp
      ToPort: 443
      FromPort: 443

  NodeGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AutoScalingGroupName: !Sub "${ClusterName}-${Environment}-cluster-nodegroup"
      DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
      LaunchConfigurationName: !Ref NodeLaunchConfig
      MinSize: !Ref NodeAutoScalingGroupMinSize
      MaxSize: !Ref NodeAutoScalingGroupMaxSize
      VPCZoneIdentifier:
        - Fn::Select:
          - 0
          - Fn::Split:
            - ","
            - Fn::ImportValue:
                !Sub ${Subnets}
        - Fn::Select:
          - 1
          - Fn::Split:
            - ","
            - Fn::ImportValue:
                !Sub ${Subnets}
        - Fn::Select:
          - 2
          - Fn::Split:
            - ","
            - Fn::ImportValue:
                !Sub ${Subnets}
      Tags:
      - Key: Name
        Value: !Sub "${ClusterName}-${Environment}-cluster-nodegroup"
        PropagateAtLaunch: 'true'
      - Key: !Sub 'kubernetes.io/cluster/${ClusterName}-${Environment}-cluster'
        Value: 'owned'
        PropagateAtLaunch: 'true'
    UpdatePolicy:
      AutoScalingRollingUpdate:
        MaxBatchSize: '1'
        MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity
        PauseTime: 'PT5M'

  NodeLaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      LaunchConfigurationName: !Sub "${ClusterName}-${Environment}-cluster-node-launch-config"
      AssociatePublicIpAddress: 'true'
      IamInstanceProfile: !Ref NodeInstanceProfile
      ImageId: !Ref NodeImageId
      InstanceType: !Ref NodeInstanceType
      KeyName: !Ref KeyName
      SecurityGroups:
      - !Ref NodeSecurityGroup
      BlockDeviceMappings:
        - DeviceName: /dev/xvda
          Ebs:
            VolumeSize: !Ref NodeVolumeSize
            VolumeType: gp2
            DeleteOnTermination: true
      UserData:
        Fn::Base64:
          !Sub |
            #!/bin/bash
            set -o xtrace
            /etc/eks/bootstrap.sh ${BootstrapArguments} ${ClusterName}-${Environment}-cluster
            sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
            sudo start amazon-ssm-agent
            sudo sysctl -w vm.max_map_count=262144
            /opt/aws/bin/cfn-signal --exit-code $? \
                     --stack  ${AWS::StackName} \
                     --resource NodeGroup  \
                     --region ${AWS::Region}

Outputs:

  NodeInstanceRole:
    Description: The node instance role
    Value: !GetAtt NodeInstanceRole.Arn
    Export:
      Name: !Sub "${ClusterName}-${Environment}-cluster-nodegroup-rolearn"

  NodeSecurityGroup:
    Description: The security group for the node group
    Value: !Ref NodeSecurityGroup

amazon-eks-cluster.yaml:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Amazon EKS - Cluster'

Parameters:

  VPCStack:
    Type: String
    Description: VPC Stack Name

  ClusterName:
    Type: String
    Description: EKS Cluster Name (i.e. "eks")

  Environment:
    Type: String
    Description: Environment for this Cluster (i.e. "dev") which will be appended to the ClusterName (i.e. "eks-dev")

Resources:

  ClusterRole:
    Description: Allows EKS to manage clusters on your behalf.
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "${ClusterName}-${Environment}-cluster-role"
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
            Effect: Allow
            Principal:
              Service:
                - eks.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
        - arn:aws:iam::aws:policy/AmazonEKSServicePolicy
      Policies:
        -
          PolicyName: "change-r53-recordsets"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: Allow
                Action: route53:ChangeResourceRecordSets
                Resource: !Sub
                  - "arn:aws:route53:::hostedzone/${ZoneId}"
                  - {ZoneId: !ImportValue DNS-AccountZoneID}
        -
          PolicyName: "list-r53-resources"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: Allow
                Action:
                  - route53:ListHostedZones
                  - route53:ListResourceRecordSets
                Resource: "*"

  ClusterControlPlaneSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub "${ClusterName}-${Environment}-cluster-control-plane-sg"
      GroupDescription: Cluster communication with worker nodes
      VpcId:
        Fn::ImportValue:
          !Sub "${VPCStack}-vpcid"

  Cluster:
    Type: "AWS::EKS::Cluster"
    Properties:
      Version: "1.14"
      Name: !Sub "${ClusterName}-${Environment}-cluster"
      RoleArn: !GetAtt ClusterRole.Arn
      ResourcesVpcConfig:
        SecurityGroupIds:
          - !Ref ClusterControlPlaneSecurityGroup
        SubnetIds:
          - Fn::Select:
            - 0
            - Fn::Split:
              - ","
              - Fn::ImportValue:
                  !Sub "${VPCStack}-privatesubnets"
          - Fn::Select:
            - 1
            - Fn::Split:
              - ","
              - Fn::ImportValue:
                  !Sub "${VPCStack}-privatesubnets"
          - Fn::Select:
            - 2
            - Fn::Split:
              - ","
              - Fn::ImportValue:
                  !Sub "${VPCStack}-privatesubnets"

  Route53Cname:
    Type: "AWS::Route53::RecordSet"
    Properties:
      HostedZoneId: !ImportValue DNS-AccountZoneID
      Comment: CNAME for Control Plane Endpoint
      Name: !Sub
        - "k8s.${Environment}.${Zone}"
        - { Zone: !ImportValue Main-zone-name}
      Type: CNAME
      TTL: '900'
      ResourceRecords:
        - !GetAtt Cluster.Endpoint

Outputs:
  ClusterName:
    Value: !Ref Cluster
    Description: Cluster Name
    Export:
      Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterName"

  ClusterArn:
    Value: !GetAtt Cluster.Arn
    Description: Cluster Arn
    Export:
      Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterArn"

  ClusterEndpoint:
    Value: !GetAtt Cluster.Endpoint
    Description: Cluster Endpoint
    Export:
      Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterEndpoint"

  ClusterControlPlaneSecurityGroup:
    Value: !Ref ClusterControlPlaneSecurityGroup
    Description: ClusterControlPlaneSecurityGroup
    Export:
      Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"

cluster-parameters.json

[
  {
    "ParameterKey": "VPCStack",
    "ParameterValue": "Main"
  },
  {
    "ParameterKey": "ClusterName",
    "ParameterValue": "amundsen-eks"
  },
  {
    "ParameterKey": "Environment",
    "ParameterValue": "dev"
  }
]

nodegroup-parameters.json:

[
  {
    "ParameterKey": "KeyName",
    "ParameterValue": "data-warehouse-dev"
  },
  {
    "ParameterKey": "NodeImageId",
    "ParameterValue": "ami-08739803f18dcc019"
  },
  {
    "ParameterKey": "NodeInstanceType",
    "ParameterValue": "r5.2xlarge"
  },
  {
    "ParameterKey": "NodeAutoScalingGroupMinSize",
    "ParameterValue": "1"
  },
  {
    "ParameterKey": "NodeAutoScalingGroupMaxSize",
    "ParameterValue": "3"
  },
  {
    "ParameterKey": "NodeAutoScalingGroupDesiredCapacity",
    "ParameterValue": "2"
  },
  {
    "ParameterKey": "NodeVolumeSize",
    "ParameterValue": "20"
  },
  {
    "ParameterKey": "ClusterName",
    "ParameterValue": "amundsen-eks"
  },
  {
    "ParameterKey": "Environment",
    "ParameterValue": "dev"
  },
  {
    "ParameterKey": "BootstrapArguments",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "VpcId",
    "ParameterValue": "Main"
  },
  {
    "ParameterKey": "Subnets",
    "ParameterValue": "Main-privatesubnets"
  }
]

And the creation scripts:

cluster: aws cloudformation create-stack \ --stack-name amundsen-eks-cluster \ --parameters file://./cluster-parameters.json \ --template-body file://../../../../templates/cloud-formation/eks/amazon-eks-cluster.yaml \ --capabilities CAPABILITY_NAMED_IAM --profile myprofile

nodegroup: aws cloudformation create-stack \ --stack-name amundsen-eks-cluster-nodegroup \ --parameters file://./nodegroup-parameters.json \ --template-body file://../../../../templates/cloud-formation/eks/amazon-eks-nodegroup.yaml \ --capabilities CAPABILITY_NAMED_IAM --profile myprofile

What would cause this behavior\what else could I check, to narrow this down?

-- javamonkey79
airflow
amazon-web-services
kubernetes
kubernetes-helm

0 Answers