Istio mTLS working just between some services even though tls-check prints STATUS OK for everyone

8/29/2019

I am trying to enable mTLS in my mesh that I have already working with istio's sidecars. The problem I have is that I just get working connections up to one point, and then it fails to connect.

This is how the services are set up right now with my failing implementation of mTLS (simplified):

Istio IngressGateway -> NGINX pod -> API Gateway -> Service A -> [ Database ] -> Service B

First thing to note is that I was using a NGINX pod as a load balancer to proxy_pass my requests to my API Gateway or my frontend page. I tried keeping that without the istio IngressGateway but I wasn't able to make it work. Then I tried to use Istio IngressGateway and connect directly to the API Gateway with VirtualService but also fails for me. So I'm leaving it like this for the moment because it was the only way that my request got to the API Gateway successfully.

Another thing to note is that Service A first connects to a Database outside the mesh and then makes a request to Service B which is inside the mesh and with mTLS enabled.

NGINX, API Gateway, Service A and Service B are within the mesh with mTLS enabled and "istioctl authn tls-check" shows that status is OK.

NGINX and API Gateway are in a namespace called "gateway", Database is in "auth" and Service A and Service B are in another one called "api".

Istio IngressGateway is in namespace "istio-system" right now.

So the problem is that everything work if I set STRICT mode to the gateway namespace and PERMISSIVE to api, but once I set STRICT to api, I see the request getting into Service A, but then it fails to send the request to Service B with a 500.

This is the output when it fails that I can see in the istio-proxy container in the Service A pod:

api/serviceA[istio-proxy]: [2019-09-02T12:59:55.366Z] "- - -" 0 - "-" "-" 1939 0 2 - "-" "-" "-" "-" "10.20.208.248:4567" outbound|4567||database.auth.svc.cluster.local 10.20.128.44:35366 10.20.208.248:4567 
10.20.128.44:35364 -
api/serviceA[istio-proxy]: [2019-09-02T12:59:55.326Z] "POST /api/my-call HTTP/1.1" 500 - "-" "-" 74 90 60 24 "10.90.0.22, 127.0.0.1, 127.0.0.1" "PostmanRuntime/7.15.0" "14d93a85-192d-4aa7-aa45-1501a71d4924" "serviceA.api.svc.cluster.local:9090" "127.0.0.1:9090" inbound|9090|http-serviceA|serviceA.api.svc.cluster.local - 10.20.128.44:9090 127.0.0.1:0 outbound_.9090_._.serviceA.api.svc.cluster.local

No messages in ServiceB though.

Currently, I do not have a global MeshPolicy, and I am setting Policy and DestinationRule per namespace

Policy:

apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
  name: "default"
  namespace: gateway
spec:
  peers:
    - mtls:
        mode: STRICT

---
apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
  name: "default"
  namespace: auth
spec:
  peers:
    - mtls:
        mode: STRICT


---
apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
  name: "default"
  namespace: api
spec:
  peers:
    - mtls:
        mode: STRICT

DestinationRule:

apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "mutual-gateway"
  namespace: "gateway"
spec:
  host: "*.gateway.svc.cluster.local"
  trafficPolicy:
tls:
  mode: ISTIO_MUTUAL

---
apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "mutual-api"
  namespace: "api"
spec:
  host: "*.api.svc.cluster.local"
  trafficPolicy:
tls:
  mode: ISTIO_MUTUAL

---
apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "mutual-auth"
  namespace: "auth"
spec:
  host: "*.auth.svc.cluster.local"
  trafficPolicy:
tls:
  mode: ISTIO_MUTUAL

Then I have some DestinationRule to disable mTLS for Database (I have some other services in the same namespace that I want to enable with mTLS) and for Kubernetes API

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: "myDatabase"
  namespace: "auth"
spec:
  host: "database.auth.svc.cluster.local"
  trafficPolicy:
    tls:
      mode: DISABLE
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: "k8s-api-server"
  namespace: default
spec:
  host: "kubernetes.default.svc.cluster.local"
  trafficPolicy:
tls:
  mode: DISABLE

Then I have my IngressGateway like so:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: ingress-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway # use istio default ingress gateway
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - my-api.example.com
      tls:
        httpsRedirect: true # sends 301 redirect for http requests
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        serverCertificate: /etc/istio/ingressgateway-certs/tls.crt
        privateKey: /etc/istio/ingressgateway-certs/tls.key
      hosts:
        - my-api.example.com

And lastly, my VirtualServices:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ingress-nginx
  namespace: gateway
spec:
  hosts:
    - my-api.example.com
  gateways:
    - ingress-gateway.istio-system
  http:
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            port:
              number: 80
            host: ingress.gateway.svc.cluster.local      # this is NGINX pod
      corsPolicy:
        allowOrigin:
          - my-api.example.com
        allowMethods:
          - POST
          - GET
          - DELETE
          - PATCH
          - OPTIONS
        allowCredentials: true
        allowHeaders:
          - "*"
        maxAge: "24h"

---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api-gateway
  namespace: gateway
spec:
  hosts:
    - my-api.example.com
    - api-gateway.gateway.svc.cluster.local
  gateways:
    - mesh
  http:
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            port:
              number: 80
            host: api-gateway.gateway.svc.cluster.local
      corsPolicy:
        allowOrigin:
          - my-api.example.com
        allowMethods:
          - POST
          - GET
          - DELETE
          - PATCH
          - OPTIONS
        allowCredentials: true
        allowHeaders:
          - "*"
        maxAge: "24h"

One thing that I don't understand is why do I have to create a VirtualService for my API Gateway and why do I have to use "mesh" in the gateways block. If I remove this block, I don't get my request in API Gateway, but if I do, it works and my requests even get to the next service (Service A), but not the next one to that.

Thanks for the help. I am really stuck with this.

Dump of listeners of ServiceA:

ADDRESS           PORT      TYPE
10.20.128.44      9090      HTTP
10.20.253.21      443       TCP
10.20.255.77      80        TCP
10.20.240.26      443       TCP
0.0.0.0           7199      TCP
10.20.213.65      15011     TCP
0.0.0.0           7000      TCP
10.20.192.1       443       TCP
0.0.0.0           4568      TCP
0.0.0.0           4444      TCP
10.20.255.245     3306      TCP
0.0.0.0           7001      TCP
0.0.0.0           9160      TCP
10.20.218.226     443       TCP
10.20.239.14      42422     TCP
10.20.192.10      53        TCP
0.0.0.0           4567      TCP
10.20.225.206     443       TCP
10.20.225.166     443       TCP
10.20.207.244     5473      TCP
10.20.202.47      44134     TCP
10.20.227.251     3306      TCP
0.0.0.0           9042      TCP
10.20.207.141     3306      TCP
0.0.0.0           15014     TCP
0.0.0.0           9090      TCP
0.0.0.0           9091      TCP
0.0.0.0           9901      TCP
0.0.0.0           15010     TCP
0.0.0.0           15004     TCP
0.0.0.0           8060      TCP
0.0.0.0           8080      TCP
0.0.0.0           20001     TCP
0.0.0.0           80        TCP
0.0.0.0           10589     TCP
10.20.128.44      15020     TCP
0.0.0.0           15001     TCP
0.0.0.0           9000      TCP
10.20.219.237     9090      TCP
10.20.233.60      80        TCP
10.20.200.156     9100      TCP
10.20.204.239     9093      TCP
0.0.0.0           10055     TCP
0.0.0.0           10054     TCP
0.0.0.0           10251     TCP
0.0.0.0           10252     TCP
0.0.0.0           9093      TCP
0.0.0.0           6783      TCP
0.0.0.0           10250     TCP
10.20.217.136     443       TCP
0.0.0.0           15090     HTTP

Dump clusters in json format: https://pastebin.com/73zmAPWg

Dump listeners in json format: https://pastebin.com/Pk7ddPJ2

Curl command from serviceA container to serviceB:

/opt/app # curl -X POST -v "http://serviceB.api.svc.cluster.local:4567/session/xxxxxxxx=?parameters=hi"
*   Trying 10.20.228.217...
* TCP_NODELAY set
* Connected to serviceB.api.svc.cluster.local (10.20.228.217) port 4567 (#0)
> POST /session/xxxxxxxx=?parameters=hi HTTP/1.1
> Host: serviceB.api.svc.cluster.local:4567
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host serviceB.api.svc.cluster.local left intact
curl: (52) Empty reply from server

If I disable mTLS, request gets from serviceA to serviceB with Curl

-- codiaf
istio
kubernetes
mtls

1 Answer

9/8/2019

General tips for debugging Istio service mesh:

  1. Check the requirements for services and pods.
  2. Try a similar task to what you are trying to perform from the list of Istio tasks. See if that task works and find the differences with your task.
  3. Follow the instructions in Istio troubleshooting section.
-- Vadim Eisenberg
Source: StackOverflow