Installation of IBM API Connect on Amazon EKS

April 8, 2020

A cost-efficient setup of IBM API Connect (APIC) has been deployed to Amazon EKS. To achieve this, per environment, a single instance of the API Manager, Developer Portal and Analytics subsystem is deployed to a ‘mgmt’ worker node. In the production environment three replicas of the API Gateway are deployed, one gateway per availability zone. 

Having single instances of the mgmt nodes and a single instance of the gateway node in the test environment lowers the cost, with the downside that the subsystems on these nodes are not highly available.

A single (managed) kubernetes cluster is used to host a production and a test environment; separate namespaces are used for the test and production environments: ‘tst-apic’ for the test environment and ‘apic’ for the production environment. 

Architecture

Used storage types:

ComponentRole
BastionHosts the IBM installation images, entry point to the cluster during installation
S3 storageStorage for backup of the API Manager, Developer Portal and Analytics subsystems
ECRRegistry for hosting the docker images of APIC
EBSBlock storage as required by APIC
CloudWatchUsed to store the logs
EmptyDirKubernetes type of storage which is used by the gateways (the gateways also use EBS)

In addition, in front of APIC there is a NetScaler with FortiGate firewalls around it followed by an Amazon load balancer. 

IBM API Connect license

The license the customer has purchased was IBM API Connect Enterprise Tier Hybrid Million. This hybrid entitlement allows for deployment to any cloud and/or on-premise location using an unrestricted selection of hardware. The components can be managed using a single web interface.

For this license customers need to count API call on a periodic basis; a script for this can be found at: https://www.ibm.com/support/knowledgecenter/SSMNED_2018/com.ibm.apic.cmc.doc/tapic_analy_script_count_api_calls.html.

Versions

ComponentVersion
IBM API Connect2018.4.1.9-ifix1.0
EKS/Docker18.06.1-ce
EKS/Kubernetes1.14
Helm2.16.1

Cluster sizing

EnvironmentK8s namespace#Node nameNode typeLocal storage on / [GB]vCPUMemory [GB]
TSTtst-apic1mgmtm5.4xlarge2001664
TSTtst-apic1gwt3.xlarge100416
PRDapic1mgmtm5.4xlarge2001664
PRDapic3gwt3.xlarge100416

Bastion

The bastion is the entry point the cluster when installing the components.

EnvironmentLocal storage on / [GB]vCPUMemory [GB]
Bastion50 on /32 on /data22

In reality 3 to 4 vCPU are available to the gateways, some CPU is used by the NGINX-ingress controller and the FluentD agent.

The reason for having the local storage is that the OS needs storage and the /var/lib/docker/overlay2 consumes a lot of space, currently after a fresh install around 23 GB. EBS storage will be allocated by the subsystems during installation.

Storage

EBS

The EBS/gp2 storage class is applied, EBS is a supported type of block storage for APIC. Note that in AWS, EBS storage is tied to an availability zone. 

In case the mgmt components would have been deployed in a HA configuration, the APIC components would synchronize the data between the different instances (Cassandra, MySQL). Therefore, in this scenario having EBS storage tied to the specific availability zone would still work.

ECR

The IBM installation images are uploaded to ECR. Because the repositories are not created automatically as compared to a standard docker registry, they need to be created beforehand:

aws ecr create-repository --repository-name apic-v2018.4.1.9/analytics-client--region eu-west-1

S3

S3 storage is used by the following subsystems

  • API Manager
  • Developer Portal
  • Analytics

The backup can be configured for the API Manager and Developer Portal during installation, for these sub systems the S3 storage buckets are created beforehand.

For the analytics subsystem the S3 storage bucket needs to be created by apicup, in our case temporary permissions were given to create this storage bucket, after creation the permissions were restricted. The bucket can be created using:

apicup subsys exec tst-analytics create-s3-repo tst-analytics eu-west-1 tst-api-analytics-backup-3dfd798e https://s3.amazonaws.com <S3 Secret Key ID> <S3 Secret Access Key> backup "" "" ""

apicup subsys exec tst-analytics list-repos
Name            Repo Type   Bucket                              BasePath   Region      Endpoint                   Chunk Size   Compress   Server Side Encryption
tst-analytics   s3          tst-api-analytics-backup-3dfd798e   backup     eu-west-1   https://s3.amazonaws.com   1gb          true       

CloudWatch

Log files of the containers and OS are sent to CloudWatch, FluentD is set up for this. This approach was preferred over sending the logs to remote syslog servers because it is easy to setup and Amazon offers Log Insight for analyzing the log data.

See: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs.html

EmptyDir

For a couple of items (e.g. the drouter-config) the API gateway uses the EmptyDir kubernetes storage type. This implies that when a pod dies this storage is deleted. The recommended approach is to run the API Gateway in an unmanaged way. In our case the password and time zone are configured beforehand. 

Installation

When preparing the installation one of the steps is to generate the output first. Values, which are not configurable through the apiconnect-up.yml file, can be changed then for:

  • The scheduling of the components using affinity and node selectors;
  • The CPU request amount of the API gateway;
  • The additionalConfig of the API Gateway.

If no configuration of scheduling is performed, Kubernetes will schedule the components at random, components belonging to the test environment will be scheduled to the worker nodes belonging to the production environment. To solve this, the worker nodes are labeled and referred to using affinity and node selectors. The result will look like:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role/eks_cluster_role_worker_node
            operator: In
            values:
            - tst-apic_gw


nodeSelector
          nodeSelector:
            node-role/eks_cluster_role_worker_node: tst-apic_mgmt

The CPU request amount cannot be 4 vCPU because the NGINX-ingress controller and FluentD agent need some CPU, after setting the CPU requested to 3 vCPU kubernetes was able to schedule the pod to the correct worker node.

API Gateway – additional configuration

Time-zone

apicup subsys install tst-gateway --out=gateway-out

For the API Gateway the extra values file contains an addition config element:

  additionalConfig:
  -  domain: default
     config: config/additional.cfg

The additional.cfg file contains:

top; configure terminal;

%if% available "timezone"
timezone "CET-1CEST"
%endif%

The config/additional.cfg should be added to the ./gateway-out/helm/dynamic-gateway-service folder. 

Password

The secret containing the password can be created using:

kubectl create secret generic admin-credentials --from-literal=password=<the password for the TST env> --from-literal=salt=lcan --from-literal=method=sha256 -n tst-apic

The reference to the admin-credentials should be put the values file:

vi ./gateway-out/helm/dynamic-gateway-service/values.yaml
  adminUserSecret: "admin-credentials"

Note that this way the password is used by the datapower monitor also, when for example putting in a hashed password in the additional.cfg this password will only be used by the API Gateway and will not be known by the datapower monitor. Errors due to passwords that differ between the datapower monitor and the gateway will show in the logs. (See Testing the API gateway peering later on).

Recreate the dynamic-gateway-service-1.0.56.tgz file under ./gateway-out/helm:


rm dynamic-gateway-service-1.0.56.tgz
tar czvf dynamic-gateway-service-1.0.56.tgz dynamic-gateway-service

The subsystem can now be installed using:

apicup subsys install tst-gw --debug --plan-dir=./gateway-out --no-verify

Verify scheduling

To see if the pods are scheduled to the correct worker node use:

kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -n tst-apic

When all is configured correctly and deployed, node selectors have to be added to the kubernetes CronJobs using:

kubectl edit cronjob -n tst-apic <name>

          dnsPolicy: ClusterFirst
          imagePullSecrets:
          - {}
          nodeSelector:
            node-role/eks_cluster_role_worker_node: tst-apic_mgmt
          restartPolicy: OnFailure
          schedulerName: default-scheduler

Note that in AWS no imagePullSecret is needed, because this was empty in the apiconnect-up.yml, this has a wrong syntax when editing the cron job yml. So, when the node selector is added, delete the imagePullSecret in order to be able to save the changes.

For the API Manager and API Gateway extra values files are used, for the API Manager the settings can be found at: https://www.ibm.com/support/knowledgecenter/SSMNED_2018/com.ibm.apic.install.doc/tapic_install_extraValues_Kubernetes.html

Persistent volumes

Default the storage class gp2 has the reclaimPolicy: Delete, the storage class can be changed before installing, or the persistent volumes can be edited afterwards to have the reclaimPolicy: Retain. Now when the persistent volume claim is deleted, the persistent volume is kept, which is a safer option.

Testing the API gateway peering (production environment)

Checking the logs of the datapower monitor pod:

[2020-02-20T13:14:05.820Z] Checking pods matching app==dynamic-gateway-service,release==r5673b1bbde
[2020-02-20T13:14:05.836Z] Found 3 pods:
[2020-02-20T13:14:05.836Z]   Pod apic/r5673b1bbde-dynamic-gateway-service-0 (Running, Ready)
[2020-02-20T13:14:05.836Z]   Pod apic/r5673b1bbde-dynamic-gateway-service-1 (Running, Ready)
[2020-02-20T13:14:05.836Z]   Pod apic/r5673b1bbde-dynamic-gateway-service-2 (Running, Ready)
[2020-02-20T13:14:05.836Z] Sending GatewayPeeringStatus request to r5673b1bbde-dynamic-gateway-service-0 (172.26.66.26:5550)
[2020-02-20T13:14:05.837Z] Sending GatewayPeeringStatus request to r5673b1bbde-dynamic-gateway-service-1 (172.26.65.135:5550)
[2020-02-20T13:14:05.837Z] Sending GatewayPeeringStatus request to r5673b1bbde-dynamic-gateway-service-2 (172.26.67.247:5550)
[2020-02-20T13:14:29.805Z] Gateway Peering gwd on pod r5673b1bbde-dynamic-gateway-service-1 has no stale peers
[2020-02-20T13:14:29.805Z] Gateway Peering gwd on pod r5673b1bbde-dynamic-gateway-service-2 has no stale peers
[2020-02-20T13:14:29.805Z] Gateway Peering gwd on pod r5673b1bbde-dynamic-gateway-service-0 has no stale peers
[2020-02-20T13:14:29.805Z] Gateway Peering rate-limit on pod r5673b1bbde-dynamic-gateway-service-1 has no stale peers
[2020-02-20T13:14:29.805Z] Gateway Peering rate-limit on pod r5673b1bbde-dynamic-gateway-service-2 has no stale peers
[2020-02-20T13:14:29.805Z] Gateway Peering rate-limit on pod r5673b1bbde-dynamic-gateway-service-0 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering subs on pod r5673b1bbde-dynamic-gateway-service-1 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering subs on pod r5673b1bbde-dynamic-gateway-service-2 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering subs on pod r5673b1bbde-dynamic-gateway-service-0 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering tms on pod r5673b1bbde-dynamic-gateway-service-1 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering tms on pod r5673b1bbde-dynamic-gateway-service-2 has no stale peers
[2020-02-20T13:14:29.806Z] Gateway Peering tms on pod r5673b1bbde-dynamic-gateway-service-0 has no stale peers

Configuration

unexpected behavior was seen:

  • Unsupported protocol
  • Time-out

Unsupported protocol

In our case the protocol had to be changed from SSLv3 to TLSv1.2, to see what protocol is used by the SMTP server, openssl can be used: 

openssl s_client -connect smtp.<domain>:587 -starttls smtp

SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-SHA384

Time-out
The time-out can be increased using the apic command (note that the DNS prefix of the cloud manager is tst-api-cloud):

./apic login -u admin -p <password> -s tst-api-cloud.<domain> -r admin/default-idp-1
./apic mail-servers:list -s tst-api-cloud.<domain> -o admin --fields name
./apic mail-servers:get <smtp-server> -s tst-api-cloud.<domain> -o admin

Edit the <mail-server-name>.yaml file, change the timeout value and set the correct password.

./apic mail-servers:update <smtp-server> smtp-server.yaml -s tst-api-cloud.<domain> -o admin
./apic mail-servers:list -s tst-api-cloud.<domain> -o admin --fields name,timeout
./apic logout --server tst-api-cloud.<domain>

Conclusion

The most important points to address were:

  • Deploying the API Gateway in an unmanaged way plus having custom configuration and a password different from the default one
  • Taking advantage of the different availability zones for each API gateway in production
  • Kubernetes to schedule the different components to the correct worker nodes. For accomplishing this, values of the helm charts need to be adjusted. Direct install using the apiconnect-up.yml configuration does not offer this.

After these points were figured out, it was straightforward to complete the installation.

Jaco Wisse

IT Architect @ Integration Designers

Integration Designers focusses on IBM products in the integration domain.

info@integrationdesigners.com

© 2019 Integration Designers - 
Privacy policy
 - Website by 
OneDot
 - Part of 
Cronos Group
 & 
integr8 consulting
map-marker linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram