Parse a nested variable from YAML file in bash

3/1/2019

A complex .yaml file from this link needs to be fed into a bash script that runs as part of an automation program running on an EC2 instance of Amazon Linux 2. Note that the .yaml file in the link above contains many objects, and that I need to extract one of the environment variables defined inside one of the many objects that are defined in the file.

Specifically, how can I extract the 192.168.0.0/16 value of the CALICO_IPV4POOL_CIDR variable into a bash variable?

        - name: CALICO_IPV4POOL_CIDR
          value: "192.168.0.0/16"

I have read a lot of other postings and blog entries about parsing flatter, simpler .yaml files, but none of those other examples show how to extract a nested value like the value of CALICO_IPV4POOL_CIDR in this question.

-- CodeMed
bash
kubernetes
yaml

5 Answers

3/1/2019

You have two problems there:

  • How to read a YAML document from a file with multiple documents
  • How to select the key you want from that YAML document

I have guessed that you need the YAML document of kind 'DaemonSet' from reading Gregory Nisbett's answer.

I will try to only use tools that are likely to be already installed on your system because you mentioned you want to do this in a Bash script. I assume you have JQ because it is hard to do much in Bash without it!

For the YAML library I tend to use Ruby for this because:

  • Most systems have a Ruby
  • Ruby's Psych library has been bundled since Ruby 1.9
  • The PyYAML library in Python is a bit inflexible and sometimes broken compared to Ruby's in my experience
  • The YAML library in Perl is often not installed by default

It was suggested to use yq, but that won't help so much in this case because you still need a tool that can extract the YAML document.

Having extracted the document I am going to again use Ruby to save the file as JSON. Then we can use jq.

Extracting the YAML document

To get the YAML document using Ruby and save it as JSON:

url=...
curl -s $url | \
  ruby -ryaml -rjson -e \
    "puts YAML.load_stream(ARGF.read)
      .select{|doc| doc['kind']=='DaemonSet'}[0].to_json" \
  | jq . > calico.json

Further explanation:

  • The YAML.load_stream reads the YAML documents and returns them all as an Array
  • ARGF.read reads from a file passed via STDIN
  • Ruby's select allows easy selection of the YAML document according to its kind key
  • Then we take the element 4 and convert to JSON.

I pass that response through jq . so that it's formatted for human readability but that step isn't really necessary. I could do the same in Ruby but I'm guessing you want Ruby code kept to a minimum.

Selecting the key you want

To select the key you want the following JQ query can be used:

jq -r \
  '.spec.template.spec.containers[].env[] | select(.name=="CALICO_IPV4POOL_CIDR") | .value' \
  calico.json                                                          

Further explanation:

  • The first part spec.template.spec.containers[].env[] iterates for all containers and for all envs inside them
  • Then we select the Hash where the name key equals CALICO_IPV4POOL_CIDR and return the value
  • The -r removes the quotes around the string

Putting it all together:

#!/usr/bin/env bash

url='https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml'

curl -s $url | \
  ruby -ryaml -rjson -e \
    "puts YAML.load_stream(ARGF.read)
      .select{|doc| doc['kind']=='DaemonSet'}[0].to_json" \
  | jq . > calico.json

jq -r \
  '.spec.template.spec.containers[].env[] | select(.name=="CALICO_IPV4POOL_CIDR") | .value' \
  calico.json

Testing:

▶ bash test.sh
192.168.0.0/16
-- Alex Harvey
Source: StackOverflow

3/1/2019

If you're able to install new dependencies, and are planning on dealing with lots of yaml files, yq is a wrapper around jq that can handle yaml. It'd allow a safe (non-grep) way of accessing nested yaml values.

Usage would look something like MY_VALUE=$(yq '.myValue.nested.value' < config-file.yaml)

Alternatively, How can I parse a YAML file from a Linux shell script? has a bash-only parser that you could use to get your value.

-- willis
Source: StackOverflow

3/1/2019
MYVAR=$(\
curl https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml | \
grep -A 1 CALICO_IPV4POOL_CIDR | \
grep value | \
cut -d ':' -f2 | \
tr -d ' "')

Replace curl https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml with however you're sourcing the file. That gets piped to grep -A 1 CALICO_IPV4POOL_CIDR. This gives you 2 lines of text: the name line, and the value line. That gets piped to grep value, which now gives us the line we want with just the value. That gets piped to cut -d ':' -f2 which uses the colon as a delimiter and gives us the second field. $(...) executes the enclosed script, and it is assigned to MYVAR. After this script, echo $MYVAR should produce 192.168.0.0/16.

-- P Ackerman
Source: StackOverflow

3/1/2019

The right way to do this is to use a scripting language and a YAML parsing library to extract the field you're interested in.

Here's an example of how to do it in Python. If you were doing this for real you'd probably split it out into multiple functions and have better error reporting. This is literally just to illustrate some of the difficulties caused by the format of calico.yaml, which is several YAML documents concatenated together, not just one. You also have to loop over some of the lists internal to the document in order to extract the field you're interested in.

#!/usr/bin/env python3

import yaml

def foo():
    with open('/tmp/calico.yaml', 'r') as fil:
        docs = yaml.safe_load_all(fil)
        doc = None
        for candidate in docs:
            if candidate["kind"] == "DaemonSet":
                doc = candidate
                break
        else:
            raise ValueError("no YAML document of kind DaemonSet")
        l1 = doc["spec"]
        l2 = l1["template"]
        l3 = l2["spec"]
        l4 = l3["containers"]
        for containers_item in l4:
            l5 = containers_item["env"]
            env = l5
            for entry in env:
                if entry["name"] == "CALICO_IPV4POOL_CIDR":
                    return entry["value"]
    raise ValueError("no CALICO_IPV4POOL_CIDR entry")

print(foo())

However, sometimes you need a solution right now and shell scripts are very good at that.

If you're hitting an API endpoint, then the YAML will usually be pretty-printed so you can get away with extracting text in ways that won't work on arbitrary YAML.

Something like the following should be fairly robust:

cat </tmp/calico.yaml | grep -A1 CALICO_IPV4POOL_CIDR | grep value: | cut -d: -f2 | tr -d ' "'

Although it's worth checking at the end with a regex that the extracted value really is valid IPv4 CIDR notation.

The key thing here is grep -A1 CALICO_IPV4POOL_CIDR .

The two-element dictionary you mentioned (shown below) will always appear as one chunk since it's a subtree of the YAML document.

    - name: CALICO_IPV4POOL_CIDR
      value: "192.168.0.0/16"

The keys in calico.yaml are not sorted alphabetically in general, but in {"name": <something>, "value": <something else>} constructions, name does consistently appear before value.

-- Gregory Nisbet
Source: StackOverflow

3/1/2019

As others are commenting, it is recommended to make use of yq (along with jq) if available.
Then please try the following:

value=$(yq -r 'recurse | select(.name? == "CALICO_IPV4POOL_CIDR") | .value' "calico.yaml")
echo "$value"

Output:

192.168.0.0/16
-- tshiono
Source: StackOverflow