Symptom
Cassandra pods can go into a CrashLoopBackOff
state while either installing or upgrading Apigee hybrid.
Error messages
kubectl get pods
output when run in apigee
namespace shows one or more Cassandra pods in CrashLoopBackOff
status:
kubectl get pods -n apigee
NAME READY STATUS RESTARTS AGE
apigee-cassandra-default-0 0/1 CrashLoopBackoff 0 9m
Possible causes
Cassandra pods can enter CrashLoopBackOff
status for various reasons.
Cause | Description |
---|---|
Failed hostname resolution | UnknownHostException thrown for DNS resolution |
Incorrect Image | Incorrect image url used in overrides.yaml |
Expansion problem | Connectivity issue to the Cassandra seed host |
Port conflict | Port 7000 or 7001 already in use |
Same IP Address | A node with address /10.10.x.x already exists |
Cause 1: Failed hostname resolution
Hostname resolution for Cassandra nodes fails due to DNS configuration issues in the cluster. Cassandra's pod logs might show similar log entry:
ERROR [main] 2025-01-12 13:23:34,569 CassandraDaemon.java:803 - Local host name unknown: java.net.UnknownHostException: ip-xx-xx-xx-xx.example.com: ip-xx-xx-xx-xx.example.com: Name or service not known
Resolution
The worker node where Cassandra pod is slated to come up should be able to resolve hostname to a valid IP address via the cluster DNS service. For a workaround, you can run the following command on all the worker nodes:
echo -e "\\n127.0.1.1 ${HOSTNAME}" >> "/etc/hosts"
Cause 2: Incorrect Image
Incorrect Cassandra image used.
Diagnosis
Check the overrides.yaml
file to make sure the correct image is configured for Cassandra. Here is a sample stanza in overrides.yaml
with the correct Cassandra image.
cassandra: image: url: "gcr.io/apigee-release/hybrid/apigee-hybrid-cassandra" tag: "1.15.0" pullPolicy: IfNotPresent
Resolution
Ensure the version and name of the Cassandra image is accurate in overrides.yaml
file and attempt the install or upgrade command again. You can use List option in apigee-pull-push
script to list all the images in your repo. You can then review those images to ensure they are all the images from the intended version of Hybrid.
Cause 3: Expansion problem
The Cassandra pod might not be able to connect to the Seed node while expanding to a new region.
Diagnosis
- Cassandra pod logs may show similar log entries as in the following example:
INFO [main] 2024-07-28 05:25:15,662 GossipingPropertyFileSnitch.java:68 - Unable to load cassandra-topology.properties; compatibility mode disabled The seed provider lists no seeds. WARN [main] 2024-07-28 05:25:15,703 SimpleSeedProvider.java:60 - Seed provider couldn't lookup host apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local Exception (org.apache.cassandra.exceptions.ConfigurationException) encountered during startup: The seed provider lists no seeds. ERROR [main] 2024-07-28 05:25:15,703 CassandraDaemon.java:803 - Exception encountered during startup: The seed provider lists no seeds. INFO [ScheduledTasks:1] 2024-07-28 05:25:15,833 StorageService.java:136 - Overriding RING_DELAY to 30000ms
- Check connectivity between the current failing node and the seed node.
- Identify the seed node configured in
overrides.yaml
for the new region. Seed node is configured inoverrides.yaml
at the time of region expansion under the field name:multiRegionSeedHost
.Sample Cassandra stanza from your
overrides.yaml
showing themultiRegionSeedHost
.cassandra: multiRegionSeedHost: "1.2.X.X" datacenter: "dc-1" rack: "rc-1" hostNetwork: false clusterName: QA
- Create a client container to check connectivity between failing Cassandra pod and seed node by following instructions in Create a client container for debugging
- After you
ssh
into the client container and have a bash shell, use telnet to check connectivity from current node to seed node via ports 7001 and 7199Sample telnet command and output showing connection failure
telnet 10.0.0.0 7001
Trying 10.0.0.0... telnet: Unable to connect to remote host: Connection timed outtelnet 10.0.0.0 7199
Trying 10.0.0.0... telnet: Unable to connect to remote host: Connection timed out
Resolution
- Work with your cluster admin team to ensure there is network connectivity between Cassandra nodes across all the clusters that are part of the same org.
- Ensure there are no firewall rules blocking traffic from the failing node to seed node.
Cause 4: Port conflict
Port conflict
Cassandra was trying to come up to listen on port 7000 and 7001 but some other service such as ssh was already listening on that port.
Diagnosis
Cassandra's pod logs may show similar entries as in the following example:
Unable to create ssl socket Fatal configuration error; unable to start server. See log for stacktrace. ERROR [main] 2023-02-27 13:01:54,239 CassandraDaemon.java:803 - Fatal configuration error org.apache.cassandra.exceptions.ConfigurationException: Unable to create ssl socket at org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:701) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:681) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:665) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:831) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:717) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:395) [apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:633) [apache-cassandra-3.11.9.jar:3.11.9] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:786) [apache-cassandra-3.11.9.jar:3.11.9] Caused by: java.net.BindException: Address already in use (Bind failed) Caused by: java.net.BindException: Address already in use (Bind failed)
This suggests that the port is already used.
Resolution
Stop and remove any service other than Cassandra listening on Ports 7000 and 7001, this should allow Cassandra applications to come up.
Cause 5: Same IP address
An older Cassandra pod with the same IP exists in the cluster. This situation can arise after uninstalling Hybrid or reinstalling Hybrid.
Diagnosis
- Review Cassandra's
system.log
file for the following error:INFO [HANDSHAKE-/10.106.32.131] 2020-11-30 04:28:51,432 OutboundTcpConnection.java:561 - Handshaking version with /10.10.1.1 Exception (java.lang.RuntimeException) encountered during startup: A node with address /10.10.1.1 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node. java.lang.RuntimeException: A node with address /10.10.1.1 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node. at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:558) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:804) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:664) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:613) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:379) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:602) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:691) ERROR [main] 2020-11-30 04:28:52,287 CassandraDaemon.java:708 - Exception encountered during startup java.lang.RuntimeException: A node with address /10.10.1.1 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
- Review the output of
nodetool status
command from any of the DCs that are still up to see if a Cassandra node is shown with the same IP as in the error message - 10.10.1.1.Sample nodetool status command output
kubectl exec apigee-cassandra-default-0 -n -- nodetool status Datacenter dc-1 ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.1.1 649.6 KiB 256 100.0% 4d96eaf5-7e75-4aeb-bc47-9e45fdb4b533 ra-1 UN 10.10.1.2 885.2 KiB 256 100.0% 261a15dd-1b51-4918-a63b-551a64e74a5e ra-1 UN 10.10.1.3 693.74 KiB 256 100.0% 91e22ba4-fa53-4340-adaf-db32430bdda9 ra-1
Resolution
- Remove the old Cassandra node by using
nodetool remove
command:nodetool removenode HOST_ID
The host id can be found in the output of nodetool status. For example:
4d96eaf5-7e75-4aeb-bc47-9e45fdb4b533
in the previous sample nodetool status output. - Once the older Cassandra node is removed, retry Hybrid installation.
Must gather diagnostic information
If the problem persists even after following the above instructions, gather the following diagnostic information and then contact Google Cloud Customer Care:
- The Google Cloud Project ID.
- The Apigee hybrid organization.
- The
overrides.yaml
files from both source and new regions, masking any sensitive information. - The outputs from the commands in Apigee hybrid must-gather.