You're viewing Apigee and Apigee hybrid documentation.
View
Apigee Edge documentation.
Symptom
Cassandra pods fail to start in one of the regions in a multi-region Apigee hybrid setup. When
applying the overrides.yaml
file, the Cassandra pods do not start successfully.
Error messages
- You will observe the following error message in the Cassandra pod logs:
Exception (java.lang.RuntimeException) encountered during startup: A node with address 10.52.18.40 already exists, cancelling join. use cassandra.replace_addrees if you want to replace this node.
You may observe the following warning in the Cassandra pod status:
Possible causes
This issue is typically observed in the following scenario:
- The Apigee runtime cluster is deleted in one of the regions.
- An attempt to reinstall the Apigee runtime cluster is made in the region with Cassandra
seed host configuration in the
overrides.yaml
file as described in Multi-region deployment on GKE and GKE on-prem. - Deleting the Apigee runtime cluster does not remove the references in the Cassandra cluster. Thus, the stale references of the Cassandra pods in the deleted cluster will be retained. Because of this, when you try to reinstall the Apigee runtime cluster in the secondary region, the Cassandra pods complain that certain IP addresses already exist. That is because the IP addresses may be assigned from the same subnet which was used earlier.
Cause | Description |
---|---|
Stale references to deleted secondary region pods in Cassandra cluster | Deleting the Apigee runtime cluster in the secondary region does not remove the references to IP addresses of Cassandra pods in the secondary region. |
Cause: Stale references to deleted secondary region pods in Cassandra cluster
Diagnosis
- The error message in the in the Cassandra pod logs
A node with address 10.52.18.40 already exists
indicates that there exists a stale reference to one of the secondary region Cassandra pods with the IP address10.52.18.40
. Validate this by running thenodetool status
command in the primary region.Sample Output:
The above example shows that the IP address
10.52.18.40
associated with Cassandra pods of the secondary region is still listed in the output. - If the output contains the stale references to Cassandra pods in the secondary region, then it indicates that the secondary region has been deleted, but the IP addresses of the Cassandra pods in the secondary region are not removed.
Resolution
Perform the following steps to remove the stale references of Cassandra pods of the deleted cluster:
- Log into the container and connect to the Cassandra command-line interface by following the steps at Create the client container.
- After you have logged into the container and connected to the Cassandra
cqlsh
interface, run the following SQL query to list the currentkeyspace
definitions:select * from system_schema.keyspaces;
Sample output showing current keyspaces:
In the following output,
Primary-DC1
refers to the primary region andSecondary-DC2
refers to the secondary region.bash-4.4# cqlsh 10.50.112.194 -u admin_user -p ADMIN.PASSWORD --ssl Connected to apigeecluster at 10.50.112.194:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. admin_user@cqlsh> Select * from system_schema.keyspaces; keyspace_name | durable_writes | replication -------------------------------------+----------------+-------------------------------------------------------------------------------------------------- system_auth | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} kvm_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} kms_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'} system_distributed | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'} perses | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} cache_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} rtc_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} quota_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} system_traces | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} (11 rows)
As you can see, the
keyspaces
are referring to bothPrimary-DC1
andSecondary-DC2
, even though the Apigee runtime cluster was deleted in the secondary region.The stale references to
Secondary-DC2
have to be deleted from each of thekeyspace
definitions. - Before deleting the stale references in the
keyspace
definitions, use the following command to delete the entire Apigee hybrid installation except ASM (Istio) andcert-manager
from theSecondary-DC2
. For more information, refer to Uninstall hybrid runtime.apigeectl delete -f YOUR_OVERRIDES_FILE.yaml --all
- Remove the stale references to
Secondary-DC2
from each of thekeyspaces
by altering thekeyspace
definition.ALTER KEYSPACE system_auth WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE kvm_tsg1_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE kms_tsg1_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE system_distributed WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE perses WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE cache_tsg1_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE rtc_tsg1_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE quota_tsg1_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}; ALTER KEYSPACE system_traces WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
- Verify that the stale references to the
Secondary-DC2
region have been removed from all thekeyspaces
by running the following command:select * from system_schema.keyspaces;
- Log in to a Cassandra pod of the
Primary-DC1
and remove the references to UUIDs of all the Cassandra pods ofSecondary-DC2
. The UUIDs can be obtained from thenodetool status
command as described earlier in Diagnosis.kubectl exec -it -n apigee apigee-cassandra-default-0 -- bash nodetool -u admin_user -pw ADMIN.PASSWORD removenode UUID_OF_CASSANDRA_POD_IN_SECONDARY_DC2
- Verify that no Cassandra pods of
Secondary-DC2
are present by running thenodetool status
command again. - Install the Apigee runtime cluster in the secondary region (
Secondary-DC2
) by following the steps in Multi-region deployment on GKE and GKE on-prem.
Must gather diagnostic information
If the problem persists even after following the above instructions, gather the following diagnostic information and then contact Google Cloud Customer Care:
- The Google Cloud Project ID
- The name of the Apigee hybrid organization
- The
overrides.yaml
files from both the primary and secondary regions, masking any sensitive information - Kubernetes pod status in all namespaces of both the primary and secondary
regions:
kubectl get pods -A > kubectl-pod-status`date +%Y.%m.%d_%H.%M.%S`.txt
- A kubernetes
cluster-info
dump from both the primary and secondary regions:# generate kubernetes cluster-info dump kubectl cluster-info dump -A --output-directory=/tmp/kubectl-cluster-info-dump # zip kubernetes cluster-info dump zip -r kubectl-cluster-info-dump`date +%Y.%m.%d_%H.%M.%S`.zip /tmp/kubectl-cluster-info-dump/*
- The output of the below
nodetool
commands from the primary region.export u=`kubectl -n apigee get secrets apigee-datastore-default-creds -o jsonpath='{.data.jmx\.user}' | base64 -d` export pw=`kubectl -n apigee get secrets apigee-datastore-default-creds -o jsonpath='{.data.jmx\.password}' | base64 -d` kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw info 2>&1 | tee /tmp/k_nodetool_info_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw describecluster 2>&1 | tee /tmp/k_nodetool_describecluster_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw failuredetector 2>&1 | tee /tmp/k_nodetool_failuredetector_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw status 2>&1 | tee /tmp/k_nodetool_status_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw gossipinfo 2>&1 | tee /tmp/k_nodetool_gossipinfo_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw netstats 2>&1 | tee /tmp/k_nodetool_netstats_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw proxyhistograms 2>&1 | tee /tmp/k_nodetool_proxyhistograms_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw tpstats 2>&1 | tee /tmp/k_nodetool_tpstats_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw gcstats 2>&1 | tee /tmp/k_nodetool_gcstats_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw version 2>&1 | tee /tmp/k_nodetool_version_$(date +%Y.%m.%d_%H.%M.%S).txt kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw ring 2>&1 | tee /tmp/k_nodetool_ring_$(date +%Y.%m.%d_%H.%M.%S).txt
References
- Apigee hybrid Multi-region deployment on GKE and GKE on-prem
- Kubernetes Documentation
-
Cassandra
nodetool status
command