Galera Cluster Limitations and Recovery

This guide outlines common limitations and issues when hosting a MariaDB Galera Cluster on the platform and provides recovery solutions for problems that may occur.

Galera Cluster Limitations

For a comprehensive list of Galera Cluster limitations, visit the official website. Here are some key limitations relevant to the platform:

Primary Key Requirement

All tables must have a Primary Key. To identify tables without a Primary Key,
run the following query:


select tab.table_schema as database_name,
tab.table_name
from information_schema.tables tab
left join information_schema.table_constraints tco
on tab.table_schema = tco.table_schema
and tab.table_name = tco.table_name
and tco.constraint_type = 'PRIMARY KEY'
where tco.constraint_type is null
and tab.table_schema not in ('mysql', 'information_schema',
'sys', 'performance_schema')
and tab.table_type = 'BASE TABLE'
order by tab.table_schema,
tab.table_name;
TIP: All tables must have a primary key, with multi-column primary keys being supported. Tables without a primary key cannot support DELETE operations, and rows in such tables may appear in different orders on different nodes.

MyISAM Tables: An experimental parameter, wsrep_replicate_myisam, has been added to the configuration file to support MyISAM tables.

TIP: Replication currently works only with the InnoDB storage engine. Writes to tables of other types, including system tables (mysql.), are not replicated. However, DDL statements that implicitly modify the mysql. tables, such as CREATE USER, are replicated. There is experimental support for MyISAM via the wsrep_replicate_myisam system variable.

Stop/Start/Restart Specifics

Stopping a cluster requires a sequential shutdown of all its nodes. The final container sets itself in bootstrap mode, initiating the cluster startup from this node.
The platform streamlines this process, eliminating the need for manual intervention. You can manage the Galera Cluster like any other environment, initiating start/stop/restart actions via the dashboard. Special events handle all necessary actions, such as sequentially withdrawing nodes from the cluster, in the background.
When restarting a single node in the cluster, standard actions apply.

Node with Maximum Transactions

It’s crucial to create a backup for the /var/lib/mysql directory on each cluster node before proceeding with any actions.

During cluster recovery, identifying the node with the highest sequence number of the last transaction is essential, as the cluster should start from this node. You can retrieve the sequence number of the last transaction from the seqno value in the /var/lib/mysql/grastate.dat file of each node.


mysql @ node270011-env-0668554 ~ $ cat /var/lib/mysql/grastate.dat | grep seqno
seqno: 1133
mysql @ node270016-env-0668554 ~ $ cat /var/lib/mysql/grastate.dat | grep seqno
seqno: 1134
mysql @ node270017-env-0668554 ~ $ cat /var/lib/mysql/grastate.dat | grep seqno
seqno: 1134

Typically, selecting a node with the highest parameter value suffices. If multiple nodes share the same highest value, opt for any of them, preferably the master container of the layer.

However, if at least one node registers a value of -1, it indicates potential inconsistency among nodes (the parameter resets to -1 during service restart on a non-functional cluster). In such instances, data recovery requires starting mysqld with the –wsrep-recover parameter.


mysqld --wsrep-recover

Retrieve the recovered position data by examining the value at the end of the line after the colons (e.g., 85340 in the provided example).
recovered position data


....2020-12-24 10:51:15 0 [Note] WSREP: Recovered position: e94ca741-44f5-11eb-9bc4-b2e17ef1657d:85340
....

Compare the recovered positions on all nodes. The node with the highest value should be selected for bootstrap. In case multiple nodes have the highest value, any of them can be chosen. Then, set the safe_to_bootstrap variable to 1 in the grastate.dat file and bootstrap from this node.

Starting Cluster after Crash

Verify the state of the MySQL processes on the nodes. Sometimes, these processes may appear as “running,” but they might not respond to normal operations like establishing a connection or stopping them via the init script. If this is the case, you’ll need to manually terminate these hung processes.

Once you’ve ensured that there are no lingering MySQL processes, proceed to restart all MySQL containers in your cluster.

Before initiating the cluster startup, check the value of the safe_to_bootstrap parameter in the /var/lib/mysql/grastate.dat file. This parameter should be set to 0 to ensure the safe startup of the cluster.

If it’s set to any other value, particularly 1, it could indicate an inconsistency in the cluster state and require further investigation or corrective actions.


grep safe_to_bootstrap /var/lib/mysql/grastate.dat safe_to_bootstrap: 0

Ensure that the safe_to_bootstrap parameter is set to 1 on the node with the highest transaction count before starting the MySQL process.


sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' /var/lib/mysql/grastate.dat
grep safe_to_bootstrap /var/lib/mysql/grastate.dat
safe_to_bootstrap: 1
service mysql start

Proceed by sequentially starting the MySQL process on the remaining nodes.

service mysql start
Keep in mind that identifying the node with the highest transaction number after a cluster crash might be challenging. In such instances, it’s advisable to set safe_to_bootstrap to 1 on the master node initially.

You can continue with the remaining nodes if MySQL starts successfully on the second node.

If an error occurs, inspect the mysqld.log on the second node. Look for a message resembling the following:


2020-11-19 16:55:20 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():422: Reversing history: 3151891 -> 3150782, this member has applied 1109 more events than the primary component.Data loss is possible. Aborting.

If such a record is present, it indicates that your second node has more transactions than the initially selected one (i.e., the first node where you set safe_to_bootstrap to 1).

Please return to the beginning of this section and restart the process, using the second node in the fourth step.

Single Node Failure

In the event of a single node failure, often caused by processing request limitations, you should:

1. Verify that no MySQL processes are running on the affected node.
2. Change the safe_to_bootstrap parameter to 0 in the /var/lib/mysql/grastate.dat file.
3. Restart the node using the init script.


/etc/init.d/mysql restart
Note: If Galera cluster limitations persist, errors may reoccur after some time.

Monitoring Galera Cluster

You can monitor the state and various parameters of the Galera Cluster by executing the SHOW GLOBAL STATUS LIKE command on any node within the cluster. This command allows you to view different aspects of the cluster’s status. For example, to check the cluster size, you can use the following command:


mysql -uuser -ppass -e "SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';"
+ -------------------- + ------- +
| Variable_name | Value |
+ -------------------- + ------- +
| wsrep_cluster_size | 3 |
+ -------------------- + ------- +
mysql -uuser -ppass -e "SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';"
+ ---------------------- + --------- +
| Variable_name | Value |
+ ---------------------- + --------- +
| wsrep_cluster_status | Primary |
+ ---------------------- + --------- +
mysql -uuser -ppass -e "SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';"
+ --------------------------- + -------- +
| Variable_name | Value |
+ --------------------------- + -------- +
| wsrep_local_state_comment | Synced |
+ --------------------------- + -------- +
Tip: Refer to the official Galera cluster documentation for more examples and detailed information.

This will display the current size of the cluster. Additionally, if your cluster includes ProxySQL nodes, you can check their status by executing the following request on any ProxySQL node:


mysql -uadmin -padmin -P6032 -h127.0.0.1 -e "select * from runtime_mysql_servers;"
Warning: Using a password on the command line interface can be insecure.
+--------------+----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 2 | node3303 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
| 3 | node3304 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
| 3 | node3303 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
| 2 | node3304 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
+--------------+----------+------+-----------+--------+--------+-------------+------------

This command provides information about the status of each ProxySQL node in the cluster. All nodes should be ONLINE for the cluster to operate effectively.

Save $100 in the next
5:00 minutes?

Register Here