I am using the airflow helm chart to run airflow on k8s. However, the web pod can't seem to connect to postgresql. The odd thing is, that other pods can.
I've cobbled together small scripts to check, and this is what I found:
[root@ip-10-56-173-248 bin]# cat checkpostgres.sh
docker exec -u root $1 /bin/nc -zvw2 airflow-postgresql 5432
[root@ip-10-56-173-248 bin]# docker ps --format '{{.Names}}\t{{.ID}}'|grep k8s_airflow|grep default|awk '{printf("%s ",$1); system("checkpostgres.sh " $2)}'
k8s_airflow-web_airflow-web-57c6dcd77b-dvjmv_default_67d74586-284b-11ea-8021-0249931777ef_74 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) : Connection timed out
k8s_airflow-worker_airflow-worker-0_default_67e1703a-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-scheduler_airflow-scheduler-5d9b688ccf-zdjdl_default_67d3fab4-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-postgresql_airflow-postgresql-76c954bb7f-gwq68_default_67d1cf3d-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
k8s_airflow-redis_airflow-redis-master-0_default_67d9aa36-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (?) open
k8s_airflow-flower_airflow-flower-79c999764d-d4q58_default_67d267e2-284b-11ea-8021-0249931777ef_0 airflow-postgresql.default.svc.cluster.local [172.20.166.209] 5432 (postgresql) open
And this is my k8s version info:
➜ ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:11:03Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
When I do a nslookup on the pod name, it seems to work fine:
# nslookup airflow-postgresql
Server: 172.20.0.10
Address: 172.20.0.10#53
Non-authoritative answer:
Name: airflow-postgresql.default.svc.cluster.local
Address: 172.20.166.209
EDIT: As requested, here is the EKS setup:
amazon-eks-nodegroup.yaml:
---
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Amazon EKS - Node Group'
Parameters:
KeyName:
Description: The EC2 Key Pair to allow SSH access to the instances
Type: AWS::EC2::KeyPair::KeyName
NodeImageId:
Type: AWS::EC2::Image::Id
Description: AMI id for the node instances.
NodeInstanceType:
Description: EC2 instance type for the node instances
Type: String
Default: t3.medium
AllowedValues:
- t2.small
- t2.medium
- t2.large
- t2.xlarge
- t2.2xlarge
- t3.nano
- t3.micro
- t3.small
- t3.medium
- t3.large
- t3.xlarge
- t3.2xlarge
- m3.medium
- m3.large
- m3.xlarge
- m3.2xlarge
- m4.large
- m4.xlarge
- m4.2xlarge
- m4.4xlarge
- m4.10xlarge
- m5.large
- m5.xlarge
- m5.2xlarge
- m5.4xlarge
- m5.12xlarge
- m5.24xlarge
- c4.large
- c4.xlarge
- c4.2xlarge
- c4.4xlarge
- c4.8xlarge
- c5.large
- c5.xlarge
- c5.2xlarge
- c5.4xlarge
- c5.9xlarge
- c5.18xlarge
- i3.large
- i3.xlarge
- i3.2xlarge
- i3.4xlarge
- i3.8xlarge
- i3.16xlarge
- r3.xlarge
- r3.2xlarge
- r3.4xlarge
- r3.8xlarge
- r4.large
- r4.xlarge
- r4.2xlarge
- r4.4xlarge
- r4.8xlarge
- r4.16xlarge
- x1.16xlarge
- x1.32xlarge
- p2.xlarge
- p2.8xlarge
- p2.16xlarge
- p3.2xlarge
- p3.8xlarge
- p3.16xlarge
- r5.large
- r5.xlarge
- r5.2xlarge
- r5.4xlarge
- r5.12xlarge
- r5.24xlarge
- r5d.large
- r5d.xlarge
- r5d.2xlarge
- r5d.4xlarge
- r5d.12xlarge
- r5d.24xlarge
- z1d.large
- z1d.xlarge
- z1d.2xlarge
- z1d.3xlarge
- z1d.6xlarge
- z1d.12xlarge
ConstraintDescription: Must be a valid EC2 instance type
NodeAutoScalingGroupMinSize:
Type: Number
Description: Minimum size of Node Group ASG.
Default: 1
NodeAutoScalingGroupMaxSize:
Type: Number
Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity.
Default: 4
NodeAutoScalingGroupDesiredCapacity:
Type: Number
Description: Desired capacity of Node Group ASG.
Default: 3
NodeVolumeSize:
Type: Number
Description: Node volume size
Default: 20
ClusterName:
Description: The cluster name provided when the cluster was created. If it is incorrect, nodes will not be able to join the cluster. i.e. "eks"
Type: String
Environment:
Description: the Environment value provided when the cluster was created. i.e. "dev"
Type: String
BootstrapArguments:
Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami
Default: ""
Type: String
VpcId:
Description: The VPC of the worker instances stack reference
Type: String
Subnets:
Description: The subnets where workers can be created.
Type: String
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
-
Label:
default: "EKS Cluster"
Parameters:
- ClusterName
-
Label:
default: "dev"
Parameters:
- Environment
-
Label:
default: "Worker Node Configuration"
Parameters:
- NodeAutoScalingGroupMinSize
- NodeAutoScalingGroupDesiredCapacity
- NodeAutoScalingGroupMaxSize
- NodeInstanceType
- NodeImageId
- NodeVolumeSize
- KeyName
- BootstrapArguments
-
Label:
default: "Worker Network Configuration"
Parameters:
- VpcId
- Subnets
Resources:
NodeInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: !Sub "${ClusterName}-${Environment}-cluster-node-instance-profile"
Path: "/"
Roles:
- !Ref NodeInstanceRole
NodeInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub "${ClusterName}-${Environment}-cluster-node-instance-role"
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- ec2.amazonaws.com
Action:
- sts:AssumeRole
Path: "/"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess
- arn:aws:iam::aws:policy/AmazonS3FullAccess
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
Policies:
-
PolicyName: "change-r53-recordsets"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Action: route53:ChangeResourceRecordSets
Resource: !Sub
- "arn:aws:route53:::hostedzone/${ZoneId}"
- {ZoneId: !ImportValue DNS-AccountZoneID}
-
PolicyName: "list-r53-resources"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Action:
- route53:ListHostedZones
- route53:ListResourceRecordSets
Resource: "*"
NodeSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for all nodes in the cluster
GroupName: !Sub "${ClusterName}-${Environment}-cluster-security-group"
VpcId:
Fn::ImportValue:
!Sub ${VpcId}-vpcid
Tags:
- Key: !Sub "kubernetes.io/cluster/${ClusterName}-${Environment}-cluster"
Value: 'owned'
NodeSecurityGroupIngress:
Type: AWS::EC2::SecurityGroupIngress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow node to communicate with each other
GroupId: !Ref NodeSecurityGroup
SourceSecurityGroupId: !Ref NodeSecurityGroup
IpProtocol: '-1'
FromPort: 0
ToPort: 65535
NodeSecurityGroupFromControlPlaneIngress:
Type: AWS::EC2::SecurityGroupIngress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow worker Kubelets and pods to receive communication from the cluster control plane
GroupId: !Ref NodeSecurityGroup
SourceSecurityGroupId:
Fn::ImportValue:
!Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
IpProtocol: tcp
FromPort: 1025
ToPort: 65535
ControlPlaneEgressToNodeSecurityGroup:
Type: AWS::EC2::SecurityGroupEgress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow the cluster control plane to communicate with worker Kubelet and pods
GroupId:
Fn::ImportValue:
!Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
DestinationSecurityGroupId: !Ref NodeSecurityGroup
IpProtocol: tcp
FromPort: 1025
ToPort: 65535
NodeSecurityGroupFromControlPlaneOn443Ingress:
Type: AWS::EC2::SecurityGroupIngress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow pods running extension API servers on port 443 to receive communication from cluster control plane
GroupId: !Ref NodeSecurityGroup
SourceSecurityGroupId:
Fn::ImportValue:
!Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
IpProtocol: tcp
FromPort: 443
ToPort: 443
ControlPlaneEgressToNodeSecurityGroupOn443:
Type: AWS::EC2::SecurityGroupEgress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow the cluster control plane to communicate with pods running extension API servers on port 443
GroupId:
Fn::ImportValue:
!Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
DestinationSecurityGroupId: !Ref NodeSecurityGroup
IpProtocol: tcp
FromPort: 443
ToPort: 443
ClusterControlPlaneSecurityGroupIngress:
Type: AWS::EC2::SecurityGroupIngress
DependsOn: NodeSecurityGroup
Properties:
Description: Allow pods to communicate with the cluster API Server
GroupId:
Fn::ImportValue:
!Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
SourceSecurityGroupId: !Ref NodeSecurityGroup
IpProtocol: tcp
ToPort: 443
FromPort: 443
NodeGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Sub "${ClusterName}-${Environment}-cluster-nodegroup"
DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
LaunchConfigurationName: !Ref NodeLaunchConfig
MinSize: !Ref NodeAutoScalingGroupMinSize
MaxSize: !Ref NodeAutoScalingGroupMaxSize
VPCZoneIdentifier:
- Fn::Select:
- 0
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub ${Subnets}
- Fn::Select:
- 1
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub ${Subnets}
- Fn::Select:
- 2
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub ${Subnets}
Tags:
- Key: Name
Value: !Sub "${ClusterName}-${Environment}-cluster-nodegroup"
PropagateAtLaunch: 'true'
- Key: !Sub 'kubernetes.io/cluster/${ClusterName}-${Environment}-cluster'
Value: 'owned'
PropagateAtLaunch: 'true'
UpdatePolicy:
AutoScalingRollingUpdate:
MaxBatchSize: '1'
MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity
PauseTime: 'PT5M'
NodeLaunchConfig:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
LaunchConfigurationName: !Sub "${ClusterName}-${Environment}-cluster-node-launch-config"
AssociatePublicIpAddress: 'true'
IamInstanceProfile: !Ref NodeInstanceProfile
ImageId: !Ref NodeImageId
InstanceType: !Ref NodeInstanceType
KeyName: !Ref KeyName
SecurityGroups:
- !Ref NodeSecurityGroup
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: !Ref NodeVolumeSize
VolumeType: gp2
DeleteOnTermination: true
UserData:
Fn::Base64:
!Sub |
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh ${BootstrapArguments} ${ClusterName}-${Environment}-cluster
sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
sudo start amazon-ssm-agent
sudo sysctl -w vm.max_map_count=262144
/opt/aws/bin/cfn-signal --exit-code $? \
--stack ${AWS::StackName} \
--resource NodeGroup \
--region ${AWS::Region}
Outputs:
NodeInstanceRole:
Description: The node instance role
Value: !GetAtt NodeInstanceRole.Arn
Export:
Name: !Sub "${ClusterName}-${Environment}-cluster-nodegroup-rolearn"
NodeSecurityGroup:
Description: The security group for the node group
Value: !Ref NodeSecurityGroup
amazon-eks-cluster.yaml:
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Amazon EKS - Cluster'
Parameters:
VPCStack:
Type: String
Description: VPC Stack Name
ClusterName:
Type: String
Description: EKS Cluster Name (i.e. "eks")
Environment:
Type: String
Description: Environment for this Cluster (i.e. "dev") which will be appended to the ClusterName (i.e. "eks-dev")
Resources:
ClusterRole:
Description: Allows EKS to manage clusters on your behalf.
Type: AWS::IAM::Role
Properties:
RoleName: !Sub "${ClusterName}-${Environment}-cluster-role"
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
Effect: Allow
Principal:
Service:
- eks.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
- arn:aws:iam::aws:policy/AmazonEKSServicePolicy
Policies:
-
PolicyName: "change-r53-recordsets"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Action: route53:ChangeResourceRecordSets
Resource: !Sub
- "arn:aws:route53:::hostedzone/${ZoneId}"
- {ZoneId: !ImportValue DNS-AccountZoneID}
-
PolicyName: "list-r53-resources"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Action:
- route53:ListHostedZones
- route53:ListResourceRecordSets
Resource: "*"
ClusterControlPlaneSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: !Sub "${ClusterName}-${Environment}-cluster-control-plane-sg"
GroupDescription: Cluster communication with worker nodes
VpcId:
Fn::ImportValue:
!Sub "${VPCStack}-vpcid"
Cluster:
Type: "AWS::EKS::Cluster"
Properties:
Version: "1.14"
Name: !Sub "${ClusterName}-${Environment}-cluster"
RoleArn: !GetAtt ClusterRole.Arn
ResourcesVpcConfig:
SecurityGroupIds:
- !Ref ClusterControlPlaneSecurityGroup
SubnetIds:
- Fn::Select:
- 0
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub "${VPCStack}-privatesubnets"
- Fn::Select:
- 1
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub "${VPCStack}-privatesubnets"
- Fn::Select:
- 2
- Fn::Split:
- ","
- Fn::ImportValue:
!Sub "${VPCStack}-privatesubnets"
Route53Cname:
Type: "AWS::Route53::RecordSet"
Properties:
HostedZoneId: !ImportValue DNS-AccountZoneID
Comment: CNAME for Control Plane Endpoint
Name: !Sub
- "k8s.${Environment}.${Zone}"
- { Zone: !ImportValue Main-zone-name}
Type: CNAME
TTL: '900'
ResourceRecords:
- !GetAtt Cluster.Endpoint
Outputs:
ClusterName:
Value: !Ref Cluster
Description: Cluster Name
Export:
Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterName"
ClusterArn:
Value: !GetAtt Cluster.Arn
Description: Cluster Arn
Export:
Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterArn"
ClusterEndpoint:
Value: !GetAtt Cluster.Endpoint
Description: Cluster Endpoint
Export:
Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterEndpoint"
ClusterControlPlaneSecurityGroup:
Value: !Ref ClusterControlPlaneSecurityGroup
Description: ClusterControlPlaneSecurityGroup
Export:
Name: !Sub "${ClusterName}-${Environment}-cluster-ClusterControlPlaneSecurityGroup"
cluster-parameters.json
[
{
"ParameterKey": "VPCStack",
"ParameterValue": "Main"
},
{
"ParameterKey": "ClusterName",
"ParameterValue": "amundsen-eks"
},
{
"ParameterKey": "Environment",
"ParameterValue": "dev"
}
]
nodegroup-parameters.json:
[
{
"ParameterKey": "KeyName",
"ParameterValue": "data-warehouse-dev"
},
{
"ParameterKey": "NodeImageId",
"ParameterValue": "ami-08739803f18dcc019"
},
{
"ParameterKey": "NodeInstanceType",
"ParameterValue": "r5.2xlarge"
},
{
"ParameterKey": "NodeAutoScalingGroupMinSize",
"ParameterValue": "1"
},
{
"ParameterKey": "NodeAutoScalingGroupMaxSize",
"ParameterValue": "3"
},
{
"ParameterKey": "NodeAutoScalingGroupDesiredCapacity",
"ParameterValue": "2"
},
{
"ParameterKey": "NodeVolumeSize",
"ParameterValue": "20"
},
{
"ParameterKey": "ClusterName",
"ParameterValue": "amundsen-eks"
},
{
"ParameterKey": "Environment",
"ParameterValue": "dev"
},
{
"ParameterKey": "BootstrapArguments",
"ParameterValue": ""
},
{
"ParameterKey": "VpcId",
"ParameterValue": "Main"
},
{
"ParameterKey": "Subnets",
"ParameterValue": "Main-privatesubnets"
}
]
And the creation scripts:
cluster: aws cloudformation create-stack \ --stack-name amundsen-eks-cluster \ --parameters file://./cluster-parameters.json \ --template-body file://../../../../templates/cloud-formation/eks/amazon-eks-cluster.yaml \ --capabilities CAPABILITY_NAMED_IAM --profile myprofile
nodegroup: aws cloudformation create-stack \ --stack-name amundsen-eks-cluster-nodegroup \ --parameters file://./nodegroup-parameters.json \ --template-body file://../../../../templates/cloud-formation/eks/amazon-eks-nodegroup.yaml \ --capabilities CAPABILITY_NAMED_IAM --profile myprofile
What would cause this behavior\what else could I check, to narrow this down?