#TECH

Error handling strategies for effortless execution of tasks in Ansible

Ansible is a tool that helps to configure one or more than one computer in a network with a single command. The modules (small programs in Ansible) are easy to understand and implement. 

Ansible is used for deployment of applications, workstation’s update and server’s update, intra-service orchestration, configuration management, cloud provisioning, etc. Basically any task a system administrator requires to do on a daily basis.  

How Ansible works

Ansible works by connecting to your nodes and pushing out small programs, called “ Ansible modules ” to them. Ansible then executes these modules (over SSH by default), and removes them when finished.

The module library can be kept on any machine, no daemons, servers, or any database is required. 

Let’s have a look at the working of Ansible through the picture below:

The node that controls the entire execution of the playbook is called management node.

The list of hosts where the modules need to be run is provided by the inventory file. SSH connection is done via management node to execute small modules to the hosts machine and install the needed product or the software.

Installing ansible is required on the control machine as there can be multiple remote machines that are controlled by the one control machine. 

Control Machine Requirements

Consider this example- I have a cluster of few nodes and I want to deploy various modules required by my application. I use Ansible to do a hassle free automated deployment on all the nodes simultaneously. But if there is an issue on any node I wouldn’t want to install the modules on other nodes and then try to up the node and do all the procedure again for the single node. It breaks the consistent installation of the cluster and the effort is more. So I use the following two error handling strategies to prevent this.

Ansible can be run from any machine with Python 2 (versions 2.6 or 2.7) or Python 3 (versions 3.5 and higher) installed.

What are we doing here-

  1. We will be ensuring the successful execution of tasks in a module through block and rescue (similar to try and except in Python).
  2. Ensuring that our all remote machines are up and running in a cluster (group of nodes) for a uniform deployment on all the nodes. If they’re not up and running then the tasks that need to be performed in that case should be executed.

Block and Rescue

With this approach we set the tasks to be performed in a block and if that block fails then the part in the rescue is executed.

Have a look at the code below:

---
 - name: test my new module
   hosts: localhost
   tasks:
    - name: Include variables of hosts.yaml into the 'host_yaml'.
      block:
        - name: Include variables of hosts.yaml into the 'host_yaml
          include_vars:
           file: /home/neeraj/Documents/blog_dir/host.yaml
           name: host_yaml
          tags: always
        
        - debug:
            msg: "nodes {{ host_yaml.all.children.NODE.hosts.keys() | list }}"

        - debug:
            msg: "-------------------------Block executed successfully--------------------"

      rescue:
        - debug: 
            msg: "-------------------------Executing tasks after block failed--------------------"

Here in the block we wanted to read variables of hosts.yaml file into host_yaml variable. After that print the content on host_yaml as a list.

In the rescue we have just printed a string to demonstrate the working of block and rescue.

Scenarios/Output :

Block is successful –

If the block is successful we get the below string printed :-

“————————-Block executed successfully——————–“

Block is failed –

If the block is failed we get the below string printed :-

“————————-Executing tasks after block failed——————–“

Conditional Execution :

This is a condition-based approach in which tasks are executed only if the condition is satisfied.

Have a look at the code below:

- name: test my new module
  hosts: localhost
  tasks:
    - name: Include variables of hosts.yaml into the 'host_yaml'.
      include_vars:
        file: /home/neeraj/Documents/blog_dir/hosts.yaml
        name: host_yaml
      tags: always

    - debug:
        msg: "nodes {{ host_yaml.all.children.NODE.hosts.keys() | list }}"

    - name: Find if nodes are running
      check_nodes_active:
        nodes: "{{ host_yaml.all.children.NODE.hosts.keys() | list }}"
      register: cluster_status
      tags: always

    - debug:
        msg: "cluster_status {{ cluster_status }}"

    - name: Set state
      set_fact: 
        cluster_state: "{{ cluster_status.state }}"

    - debug:
        msg: "state - {{ cluster_state }}"
  

- hosts: localhost
  vars:
    phase: "update"
  tasks:
    - meta: end_play
      when: hostvars['localhost']['cluster_state'] == 'DOWN'

    - debug:
        msg: "-------------------------I am executed as all nodes are up--------------------------"


- hosts: localhost
  vars:
    phase: "upgrade"
  tasks:
    - meta: end_play
      when: hostvars['localhost']['cluster_state'] == 'DOWN'

    - debug:
        msg: "-------------------------I am executed as all nodes are up--------------------------"

Here similar to block and rescue code we read variables of  hosts.yaml file into host_yaml variable.

We pass the list of nodes we get in the host_yaml to a module which will ping on those  nodes and check if they are active or not and will return cluster status as UP or DOWN. If any node is down the status will be DOWN else if all the nodes are up and running.

The status will be UP.

Then we set a globally accessible variable cluster_state as the status our module returned so it can be accessed in multiple sets of tasks.

In those sets of tasks we use meta: end_play with a condition that will stop the execution of tasks further in that phase if the condition is not met. So we will stop the further execution with meta: end_playwith a condition that checks if the cluster_state is DOWN .

Output- if all nodes are up :

Since we had the ansible module that pings to google.com and received the status as up, we got all the tasks executed else the tasks would have skipped if the status was down.