My MySQL tips valid-rss-rogers

 

A small thing that brings huge help.

The other day I was writing some code to process a very large amount of items coming from a social media API. My items were ending in a queue in MySQL and then needed to be processed and eventually moved.

The task was not so strange,  but what I have to do is to develop a queue processor.  Now when you need to process a queue you have two types of queue: static and dynamic.

The static comes in a batch of N number of items in a given time interval and is normally easier to process given you have a defined number of items that you can split in chunks and process in parallel.

The dynamic is… well... more challenging. One option is to wait to have a predefined number of items, and then process them as if they were a static queue.

But this approach is not very good, given it is possible that it will delay a lot the processing of an item for all the time it has to wait to reach the desired queue’s size.

The other possibility is to have the processing jobs work on a single item and not in chunk/batch. But, this is not optimal when given the chance to process a queue in batch to speed up the processing time.

My incoming queue is a bit unpredictable, a mix of fixed sizes and a few thousand coming sparse, without a clear interval.  So I was there thinking on how to process this and already starting to design a quite complex mechanism to dynamically calculate the size of the possible chunks and the number of jobs, when…

An aside: some colleagues know my habit to read the whole MySQL manual, from A to Z, at least once a year. It's a way for me to review what is going on and sometimes to dig in more in some aspects. This normally also gives me a good level of confidence about new features and other changes on top of reading the release notes.

...When … looking at the documentation for something else, my attention was captured by:

“To avoid waiting for other transactions to release row locks, NOWAIT and SKIP LOCKED options may be used with SELECT ... FOR UPDATE or SELECT ... FOR SHARE locking read statements.”

Wait -  what???

Let me dig in a bit:

“SKIP LOCKED. A locking read that uses SKIP LOCKED never waits to acquire a row lock. The query executes immediately, removing locked rows from the result set.”

Wow, how could I have missed that?

It was also not new but in MySQL 8.0.1, the milestone release. Having experience with Oracle, I knew what SKIP LOCKED does and how to use it. But I was really not aware that it was also available in MySQL.

In short, SKIP LOCKED allows you to lock a row (or set of them), bypassing the rows already locked by other transactions.

The classic example is:

# Session 1:
mysql> CREATE TABLE t (i INT, PRIMARY KEY (i)) ENGINE = InnoDB;
mysql> INSERT INTO t (i) VALUES(1),(2),(3);
mysql> START TRANSACTION;
mysql> SELECT * FROM t WHERE i = 2 FOR UPDATE;
+---+
| i |
+---+
| 2 |
+---+

# Session 2:
mysql> START TRANSACTION;
mysql> SELECT * FROM t FOR UPDATE SKIP LOCKED;
+---+
| i |
+---+
| 1 |
| 3 |
+---+

But what is important for me is that given an N number of jobs running, I can bypass all the effort of calculating the dynamic chunks, given that using SKIP LOCKED that will happen as well, if in a different way.

All good, but what performance will I have using SKIP LOCK in comparison to the other two solutions?

I have run the following tests on my laptop, so not a real server, and used a fake queue processor I wrote on the fly to test the things you can find on GitHub here.  

What I do is to create a table like this:

CREATE TABLE `jobs` (
  `jobid` int unsigned NOT NULL AUTO_INCREMENT,
  `time_in` bigint NOT NULL,
  `time_out` bigint DEFAULT '0',
  `worked_time` bigint DEFAULT '0',
  `processer` int unsigned DEFAULT '0',
  `info` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`jobid`),
  KEY `idx_time_in` (`time_in`,`time_out`),
  KEY `idx_time_out` (`time_out`,`time_in`),
  KEY `processer` (`processer`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Then I will do three different methods of processing:

  1. Simple process, reading and writing a row a time
  2. Use chunks to split my queue
  3. Use SKIP LOCKED

To clarify the difference existing between the 3 different way of processing the queue let us use 3 simple images:

proc1

In simple processing each row represent a possible lock that the other processes must wait for.

proc2

In chunk processing given each process knows what records to lock they can go in parallel.

proc3

In SKIP LOCKED also if each process have no idea of what rows they need to lock, is enough to say the size of the chunk, and MySQL will return the records available. 

 

I will repeat the tests for a static queue, and after for a dynamic queue for 20 seconds. Let's see what happens.

Test one - simple processing and static queue:

queue Processor report:
=========================
Start Time                = Tue Jul 28 13:28:25 CEST 2020
End Time                  = Tue Jul 28 13:28:31 CEST 2020
Total Time taken(ms)      =     6325.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     220308

Test 2 use the chunks and static queue

Chunk no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:30:18 CEST 2020
End Time                  = Tue Jul 28 13:30:22 CEST 2020
Total Time taken(ms)      =     4311.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     391927

Test three - use SKIP LOCKED and static queue:

SKIP LOCK - no loop
------------------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:32:07 CEST 2020
End Time                  = Tue Jul 28 13:32:11 CEST 2020
Total Time taken(ms)      =     4311.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     366812

So far, so good.

Time is almost the same (actually in this test, it's exactly the same), normally fluctuating a bit up and down by a few ms.

Picture 1 

 Average execution by commit fluctuates a bit:

Picture 2

Here, the base execution is faster for the simple reason that the application is processing one record against a batch of records of the other two.

Now it is time to see what happens if instead of a static batch, I have a process that fills the queue constantly. If you want picture what will happen on each test, just imagine this:

  • Some pipes will put water in a huge tank
  • The base solution will try to empty the tank using a small glass of water but acting very fast at each run
  • With Chunk it will wait for the water to reach a specific level, then will use a fixed-size bucket
  • Using SKIP LOCK, it will constantly look at the tank and will choose the size of the bucket based on the quantity of the water present.

To simulate that, I will use five threads to write new items, five to process the queue, and will run the test for 20 seconds.

Here we will have some leftovers; that is how much water remains in the tank because the application was not emptied with the given bucket. We can say it is a way to measure the efficiency of the processing, where the optimal sees the tank empty.

Test one -  simple processing and static queue:

Basic no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:42:37 CEST 2020
End Time                  = Tue Jul 28 13:43:25 CEST 2020
Total Time taken(ms)      =    48586.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=         85
Avg records proc time     =     243400

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|   143863 |     225000 |
+----------+------------+

Test 2 use the chunks and static queue:

Chunk no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:44:56 CEST 2020
End Time                  = Tue Jul 28 13:45:44 CEST 2020
Total Time taken(ms)      =    47946.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=         70
Avg records proc time     =     363559

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|       53 |     195000 |
+----------+------------+

Test 3 use SKIP LOCKED and static queue:

queue Processor report:
=========================
Start Time                = Tue Jul 28 14:46:45 CEST 2020
End Time                  = Tue Jul 28 14:47:32 CEST 2020
Total Time taken(ms)      =    46324.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=       1528
Avg records proc time     =     282658

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|        0 |       NULL |
+----------+------------+

 Here, the scenario is a bit different than the one we had with the static queue.

Picture 3

Here, the scenario is a bit different than the one we had with the static queue.

Picture 4

Record processing when comparing by chunk and SKIP LOCK is again more efficient in the second one. This is because it optimizes the size of the “bucket” and given that it can sometimes process fewer records per loop.

Picture 15

As we can see when using SKIP LOCK, the application was able to execute 1528 loops to process the queue against the 70 of the chunk and 85 of the basic approach.

In the end, the only one that was able to empty the tank was the solution with SKIP LOCK.

Conclusion

Processing queues can be simple when we need to process a fixed number of items, but if you need an adaptive approach, then the situation changes. You can find yourself writing quite complex algorithms to optimize the processing.

Using SKIP LOCK helps you in keeping the code/solution simple and move the burden of identifying the record to process onto the RDBMS.

SKIP LOCK is something that other technologies like Oracle-DB and Postgres already implemented, and their developer communities use.

MySQL implementation comes a bit later, and the option is not widely known or used in the developer’s community using MySQL, but it should.

Give it a try and let us know!

NOTE !!

SKIP LOCK is declared unsafe for statement replication, you MUST use ROW based replication if you use it.

References

MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED

MySQL 8.0 Reference Manual: Locking Reads

 

 

 

 

Why another article on this Marco?

Deadlocks is a topic covered many times and with a lot of articles on the web, also from Percona.
I suggest you review the reference section for articles on how to identify Deadlocks and from where they are generated.
So why another article?
The answer is that messages like the following are still very common:

User (John): “Marco our MySQL is having problems”
Marco: “Ok John what problems. Can you be a bit more specific?”
John: “Our log scraper is collecting that MySQL has a lot of errors”
Marco: “Ok can you share the MySQL log so I can review it?”
John: “Errors are in the application log, will share one application log”

Marco reviews the log and in it he founds:

“ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction”

Marco reaction is: "Oh my ..." headdesk

Why? Because deadlocks are not what is express in the message, and of course we have a problem of mindset and last but not least terminology.

In this very short article I will try to change your point of view around Deadlocks.

What is a deadlock?

A deadlock is a situation wherein two or more competing actions are waiting for the other to finish. As a consequence, neither ever does.
In computer science, deadlock refers to a specific condition when two or more processes are each waiting for each other to release a resource.
In order for a deadlock to happen 4 conditions (Coffman conditions) should exists:
Mutual exclusion: At least one resource must be held in a non-shareable mode. Otherwise, the processes would not be prevented from using the resource when necessary. Only one process can use the resource at any given instant of time.
Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
No preemption: a resource can be released only voluntarily by the process holding it.
Circular wait: each process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.

All the above illustrates conditions that are not bound to RDBMS only but to any system dealing with data transaction processing. In any case it is a fact that today in most cases deadlocks are not avoidable unless to prevent one of the above conditions to happen without compromising the system execution integrity. Breaking or ignoring one of the above rules, especially for RDBMS, could affect data integrity, which will go against the reason to exist of a RDBMS.

Just to help us to better contextualize, let us review a simple case of Deadlock.
Say I have MySQL with the World schema loaded, and I have the TWO transactions running, both looking for the same 2 cities in Tuscany (Firenze and Prato) but in different order.

mysql> select * from City where CountryCode = 'ITA' and District='Toscana';
+------+---------+-------------+----------+------------+
| ID   | Name    | CountryCode | District | Population |
+------+---------+-------------+----------+------------+
| 1471 | Firenze | ITA    | Toscana       | 376662     | <---
| 1483 | Prato   | ITA    | Toscana       |  172473    | <--- ...
+------+---------+-------------+----------+------------+

And both transactions are updating the population: 

Connection 1 will have: connection1 > start transaction;
Query OK, 0 rows affected (0.01 sec)

connection1 > select * from City where ID=1471;
+------+---------+-------------+----------+------------+
| ID   | Name    | CountryCode | District | Population |
+------+---------+-------------+----------+------------+
| 1471 | Firenze | ITA         | Toscana  | 376662     |
+------+---------+-------------+----------+------------+
1 row in set (0.00 sec)

connection1 > update City set Population=Population + 1 where ID = 1471;
Query OK, 1 row affected (0.05 sec)
Rows matched: 1 Changed: 1 Warnings: 0

connection1 > update City set Population=Population + 1 where ID = 1483;
Query OK, 1 row affected (2.09 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Connection 2 will have:
connection 2 >start transaction;
Query OK, 0 rows affected (0.01 sec)

connection 2 >select * from City where ID=1483;
+------+-------+-------------+----------+------------+
| ID   | Name  | CountryCode | District | Population |
+------+-------+-------------+----------+------------+
| 1483 | Prato | ITA         | Toscana  | 172473     |
+------+-------+-------------+----------+------------+
1 row in set (0.01 sec)

connection 2 >update City set Population=Population + 1 where ID = 1483;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

connection 2 >update City set Population=Population + 1 where ID = 1471;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

This is a very simple example of deadlock detection
An image may help:

dl ff 1 

 

If we stop a second and ignore the word “ERROR” in the message, what is really happening is that MySQL is preventing us from modifying the data in the wrong way. If the locks would not be in place one of the two transactions would modify the population incrementing a number that is not valid anymore.
The right thing to do is to abort one of the two transactions and NOTIFY the application that, if you really need to perform the action, in this case increase the population, better to redo the execution and be sure it is still the case.
Just think, it could happen that the application re-run transactions 2 and identify there is no need to increase the value because it is already what it is supposed to be.

Think if you are calculating the financial situation of your company and you and your colleague are processing the same data but for different tasks.
Without locks & deadlocks you may end up in corrupting each other's interpretation of the data, and perform wrong operations. As a result you may end up paying the wrong salaries or worse.

Given that, and more, deadlocks (and locks) needs to be seen as friends helping us in keeping our data consistent.
The problem raise, when we have applications poorly designed and developed, and unfortunately by the wrong terminology (in my opinion) in MySQL.

Let us start with MySQL, Deadlock detection is detecting an intrinsic inevitable condition in the RDBMS/ACID world. As such defining it an ERROR is totally misleading. A deadlock is a CONDITION, and its natural conclusion is the abortion of one of the transactions reason of the deadlock.
The message should be a NOTIFICATION not an ERROR.

The problem in the apps instead, is that normally the isolation and validation of the data is demanded to RDBMS, which is fine. But then only seldom can we see applications able to deal with messages like lock-timeout or deadlock. This is of course a huge pitfall, because while it is natural to have the RDBMS dealing with the data consistency, it is not, and should not, be responsible for the retry that is bound to the application logic.
Nowadays we have a lot of applications that require very fast execution, and locks and deadlocks are seen as enemies because they have a cost in time.

But this is a mistake, a design mistake. Because if you are more willing to have speed instead of data consistency, then you should not use a RDBMS that must respect specific rules, at any (time) cost.
Other systems to store data (eventually consistent) will be more appropriate in your case.

While if you care about your data, then you need to listen to your RDBMS and write the code in a way, you will get all the benefit out of it, also when it comes to deadlocks.

Conclusion

Deadlocks (and locks), should be seen as friends. They are mechanisms that exist to keep our data consistent. We should not bypass them unless willing to compromise our data.

As previously indicated, if you want to understand in the details how to diagnose a deadlock review the links in the reference.

References
https://www.percona.com/blog/2012/09/19/logging-deadlocks-errors/
https://www.percona.com/blog/2014/10/28/how-to-deal-with-mysql-deadlocks/
https://www.percona.com/community-blog/2018/09/24/minimize-mysql-deadlocks-3-steps/

 

Back to part 1

Now that we had seen how to setup our environment with PXC8 is time to see how our setup will behave and what can be done in case of issues.

We will now analyse the following common situations:
- Put a node in maintenance or remove a node and insert it back
- node crash
- network failure
- proxysql node crash

The following tests are done using a java application connecting using straight JDBC connection and no connection pooling. I choose to use that more than a benchmarking tool such as sysbench because I was looking for something as close to a real application more than a benchmark. Especially I was interested in having dependencies cross requests and real data retrieve and error management.

As soon as we start the schedule script, the script checks the nodes and identify that we have writer_is_also_reader=0 so it removes the entry for the current writer

2020/07/03 06:43:50.130:[INFO] Writer is also reader disabled removing node from reader Hostgroup 192.168.1.5;3306;201;5000 Retry #1
2020/07/03 06:43:50.138:[WARN] DELETE node:192.168.1.5;3306;201;5000 SQL: DELETE from mysql_servers WHERE hostgroup_id=201 AND hostname='192.168.1.5' AND port='3306'
2020/07/03 06:43:50.143:[INFO] END EXECUTION Total Time(ms):277.705907821655

Test 1 Put a node in maintenance

PXC has a very useful feature pxc_maint_mode to deal with maintenance and to notify applications and midlevel architectural blocks (such as ProxySQL) that a node is going to be under maintenance.

With pxc_maint_mode you can specifies the maintenance mode for taking a node down without adjusting settings in ProxySQL.

The following values are available:

  • DISABLED: This is the default state that tells ProxySQL to route traffic to the node as usual.
  • SHUTDOWN: This state is set automatically when you initiate node shutdown.
  • MAINTENANCE: You can manually change to this state if you need to perform maintenance on a node without shutting it down.

The First test is to put a reader in maintenance and put it back.
current scenario in ProxySQL:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 10000  | 200          | 192.168.1.5 | ONLINE | 2        | 3        | 5      | 0	| 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 2        | 15       | 17     | 0	| 17          |
| 10000  | 201          | 192.168.1.6 | ONLINE | 0        | 12       | 12     | 0	| 12          |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0.      | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+

Putting down the 192.168.1.6 node:

(root localhost) [(none)]>set global pxc_maint_mode=maintenance;
Query OK, 0 rows affected (10.00 sec)
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 10000  | 200          | 192.168.1.5 | ONLINE       | 0        | 5        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 1        | 15       | 17     | 0       | 17          |
| 10000  | 201          | 192.168.1.6 | OFFLINE_SOFT | 0        | 0        | 12     | 0       | 12          | 
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+

Node is in OFFLINE_SOFT as such it will allow any existing connection to finish while not accepting any new.
We can wait for running connection to end, then do whatever kind of maintenance we need without affecting our production.

Once done we can move the node back:

[(none)]>set global pxc_maint_mode=disabled;
Query OK, 0 rows affected (0.00 sec)

In the script log we will see that the node is identify as ready to be put back:

2020/07/03 06:58:23.871:[INFO] Evaluate nodes state 192.168.1.6;3306;201;1000 Retry #1
2020/07/03 06:58:23.882:[WARN] Move node:192.168.1.6;3306;201;1000 SQL: UPDATE mysql_servers SET status='ONLINE' WHERE hostgroup_id=201 AND hostname='192.168.1.6' AND port='3306'
2020/07/03 06:58:23.887:[INFO] END EXECUTION Total Time(ms):265.045881271362

 In all this we never had a moment of service interruption.

The above test was the easy one.

Let us now see what happens if we put the writer in maintenance.


This is a much more impacting action, given the node is accepting write transactions and is in single mode.
Ler us put the writer 192.168.1.5 in maintenance:

set global pxc_maint_mode=maintenance;
Query OK, 0 rows affected (10.00 sec)

 And in few seconds:

+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE       | 3        | 2        | 5      | 0       | 5           |
| 10000  | 200          | 192.168.1.5 | OFFLINE_SOFT | 0        | 0        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 4        | 12       | 17     | 0       | 17          |
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+

 What happened?

In the script log:

...
2020/07/03 08:11:22.110:[INFO] END EXECUTION Total Time(ms):231.402158737183
2020/07/03 08:11:24.299:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 1
2020/07/03 08:11:24.307:[WARN] Move node:192.168.1.5;3306;200;3020 SQL: UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=200 AND hostname='192.168.1.5' AND port='3306'
2020/07/03 08:11:24.313:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.6:3306:HG8200
2020/07/03 08:11:24.313:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.6:3306:HG 8200
2020/07/03 08:11:24.313:[WARN] DELETE from writer group as: SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'
2020/07/03 08:11:24.720:[WARN] Move node:192.168.1.6:33069992000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.6',200,3306,999,2000);
2020/07/03 08:11:24.720:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/03 08:11:24.720:[INFO] END EXECUTION Total Time(ms):685.911178588867
...

The script identify the need to shift production to another node.
It set the current writer as offline in the staging environment and identify which node from the special group 8000 is the appropriate replacement.

Then push all the changes to runtime
All is done in ~700 ms, so the whole process takes less then a second and the production was not impacted. 

Test 2 writer node crash

Note: In the text below MySQL is set to work on CEST while system is EDT.

This is of course a much more impacting scenario, we need to keep in to account not only the the situation of the node in ProxySQL, but the need for PXC to rebuild the Primary view getting quorum etc..

Initial picture:

+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE       | 3        | 2        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 4        | 12       | 17     | 0       | 17          |
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+/pre>

 

Abruptly killing the writer with kill -9 signal:

[root@galera1h1n6 gal8_2]# ps aux|grep gal8_2
root 5265 0.0 0.0 113452 876 ? S Jun26 0:00 /bin/sh /opt/mysql_templates/PXC8/bin/mysqld_safe --defaults-file=/opt/mysql_instances/gal8_2/my.cnf
mysql 8745 7.7 72.9 9355104 5771476 ? Sl Jun26 882:54 /opt/mysql_templates/PXC8/bin/mysqld

PXC cluster start to identify the issue:

2020-07-04T10:05:41.011531Z 0 [Note] [MY-000000] [Galera] (27260297, 'tcp://192.168.1.5:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.6:4567

And script as well:

2020/07/04 06:05:41.000:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.6;port=3306 as monitor
2020/07/04 06:05:41.001:[ERROR]  Node is not responding setting it as SHUNNED (ProxySQL bug - #2658)192.168.1.6:3306:HG200

2020/07/04 06:05:41.052:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 1
2020/07/04 06:05:41.061:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
 This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.5:3306:HG8200
2020/07/04 06:05:41.062:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.5:3306:HG 8200
2020/07/04 06:05:41.062:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'

 As said there is also the need from the cluster to rebuild the cluster view and get a quorum (see below):

2020-07-04T10:05:45.685154Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,27260297,16)
memb {
	27260297,1
	7e4d3144,1
	}
joined {
	}
left {
	}
partitioned {
	3eb94984,1
	}
)

As soon as the view is available the script can perform the failover:

2020/07/04 06:05:46.318:[WARN] Move node:192.168.1.5:330610002000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.5',200,3306,1000,2000);
2020/07/04 06:05:46.318:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/04 06:05:46.318:[INFO] END EXECUTION Total Time(ms):5551.42211914062

the whole exercise takes 5 seconds.

Which for a server crash is not bad at all.

In my case given the Java application was design to deal with minimal service interruption with retry loop, I did not had any error, but this depends on how you had wrote the application layer. 

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 4        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE  | 0        | 26       | 30     | 0       | 28          |
| 998    | 8200         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

Node also if SHUNNED in the special groups will be normally managed and in the need put ONLINE. 

Crash of a reader

 Let us now do the same with a reader

[root galera1h1n7 gal8_3]# date; kill -9 5259 8739
Sat Jul 4 06:43:46 EDT 2020

The script will fail to connect :

2020/07/04 06:43:46.923:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.7;port=3306 as monitor
2020/07/04 06:43:46.923:[ERROR]  Node is not responding setting it as SHUNNED (ProxySQL bug - #2658)192.168.1.7:3306:HG8201

2020-07-04T10:43:47.998377Z 0 [Note] [MY-000000] [Galera] (27260297, 'tcp://192.168.1.5:4567') reconnecting to 7e4d3144 (tcp://192.168.1.7:4567), attempt 0

The node is internally SHUNNED by the script given not accessible, while waiting for ProxySQL to take action and shun the node. All reads request are managed by ProxySQL.
The final picture will be:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 5        | 6      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | SHUNNED | 0        | 0        | 30     | 98      | 28          |
| 1000   | 201          | 192.168.1.6 | ONLINE  | 1	   | 24       | 25     | 0	 | 25          |
| 998    | 8200         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

 No service interruption from the read side thanks to ProxySQL.

Test 3 network failure

Network failures are always tricky, they can happen as a simple single instance, or for a large but unpredictable window of time.
Network can have a slow down first such that the node affect the whole cluster, then fully resolve themselves, leaving you with very limited data to understand what and why that happens.

Given that is important to try to take action to limit negative effects as much as possible.
That level of actions are above the scope of a scheduler Script, who is in charge only of the layout of the nodes.
Given that what it should do is not to solve the possible impact at PXC level but reduce the possible confusion in distributing the traffic with ProxySQL.

The initial picture is:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 5        | 6      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | SHUNNED | 0        | 0        | 30     | 98      | 28          |
| 1000   | 201          | 192.168.1.6 | ONLINE  | 1	   | 24       | 25     | 0	 | 25          |
| 998    | 8200         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

On node 192.168.1.5 I will stop the network interface.

enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.5 netmask 255.255.255.0 broadcast 192.168.1.255
[vagrant galera1h1n5 ~]$ date;sudo ifdown enp0s8
Sat Jul 4 13:20:32 EDT 2020
Device 'enp0s8' successfully disconnected.

As soon as I stop it, the PXC cluster identify the node is not reachable:

2020-07-04T17:20:35.980786Z 0 [Note] [MY-000000] [Galera] (7e4d3144, 'tcp://192.168.1.7:4567') connection to peer 27260297 with addr tcp://192.168.1.5:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)

The script is running and must timeout the attempt to connect (6 seconds), and fencing other process to start:

2020/07/04 13:20:36.729:[ERROR] Another process is running using the same HostGroup and settings,
Or orphan pid file. check in /tmp/proxysql_galera_check_200_W_201_R.pid
2020/07/04 13:20:38.806:[ERROR] Another process is running using the same HostGroup and settings,
Or orphan pid file. check in /tmp/proxysql_galera_check_200_W_201_R.pid

Finally script node connection time out and failover starts

2020/07/04 13:20:40.699:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.5;port=3306;mysql_connect_timeout=6 as monitor
2020/07/04 13:20:40.699:[ERROR] Node is not responding setting it as SHUNNED (internally) (ProxySQL bug - #2658)192.168.1.5:3306:HG200
2020/07/04 13:20:40.804:[WARN] Fail-over in action Using Method = 1
2020/07/04 13:20:40.805:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.6:3306:HG8200
2020/07/04 13:20:40.805:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.6:3306:HG 8200
2020/07/04 13:20:40.805:[WARN] DELETE from writer group as: SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'
2020/07/04 13:20:42.622:[WARN] Move node:192.168.1.6:33069992000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.6',200,3306,999,2000);
2020/07/04 13:20:42.623:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/04 13:20:42.623:[INFO] END EXECUTION Total Time(ms):7983.0858707428

Script does failover due to network in 7 seconds, 10 seconds from network issue

final picture:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE  | 0        | 5        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE  | 3        | 16       | 60     | 721     | 28          |
| 998    | 8200         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

Given the level of complexity a network failure brings, I think this is a pretty good results, also if it will be inevitable to have some application errors given the connection was on the fly, and for that there is no way to prevent, except re-run the whole transaction.

Test 4 ProxySQL node crash

 Last test is what happen if a proxysql node crash? We have set a VIP that we use to connect to ProxySQL (192.168.1.194), and set Keepalived to deal with the move of the VIP cross nodes will that be efficient enough?

Let us try killing one of the ProxySQL node.

Picture before:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE | 3        | 1        | 4      | 0	    | 4           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 3        | 7        | 10     | 0	    | 10          |
| 1000   | 201          | 192.168.1.5 | ONLINE | 0        | 2        | 2      | 0	    | 2           |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+

I connect on the current Proxy node that is in charge of the `vip and:

[root proxy1 proxysql]# date;kill -9 24890 24891
Sat Jul 4 14:02:29 EDT 2020

In the system log of the next in the chain Proxy server (different machine), Keepalived identify the node is down and swap the VIP:

Jul 4 14:02:31 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) forcing a new MASTER election
Jul 4 14:02:32 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Transition to MASTER STATE
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Entering MASTER STATE
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) setting protocol VIPs.
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Sending/queueing gratuitous ARPs on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194

Application identify few connection that were close unexpectedly and try to restore them:

20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it

So in few seconds ProxySQL on the Proxy2 node is up and running able to manage the traffic:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE | 0        | 6        | 6      | 0	| 6           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 3        | 29       | 32     | 0	| 32          |
| 1000   | 201          | 192.168.1.5 | ONLINE | 0        | 2        | 2      | 0	| 2           |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+ 

Full failover done in 4 Seconds !!!!

Also this to me sounds a quite good achievement.

Conclusions

We can achieve very highly customisable setup with the use of the scheduler+script in ProxySQL.

The flexibility we get is such that we can create a very high available solution without the need to accept compromises impose by native galera support.

All this without reducing the efficiency of the solution.
In terms of time we had (all times may have a max additional +2 seconds due the scheduler interval we had define):

  • Maintenance on writer - All is done in ~700 ms no service interruption.
  • Writer crash -  whole failover takes 5 seconds. It may have the need to retry transactions.
  • Reader crash - no service interruption
  • Network issue (with PXC cluster quorum) between 7 and 10 seconds
  • ProxySQL node crash - 4 seconds to fail over another ProxySQL node

Of course applications must be design to be fault tolerant and retry the transactions if any problem raise, I would say this goes without saying and is in any programming Best Practices.
If you do not do that and fully rely on the data layer, better for you to review your developers team.

 

In real life, there are frequent cases where getting a running application to work correctly is strongly dependent on consistent write/read operations. This is no issue when using a single data node as a provider, but it becomes more concerning and challenging when adding additional nodes for high availability and/or read scaling.

In the MySQL dimension, I have already described it here in my blog Dirty Reads in High Availability Solution.

We go from the most loosely-coupled database clusters with primary-replica async replication, to the fully tightly-coupled database clusters with NDB Cluster (MySQL/Oracle).

Adding components like ProxySQL to the architecture can, from one side, help in improving high availability, and from the other, it can amplify and randomize the negative effect of a stale read. As such it is crucial to know how to correctly set up the environment to reduce the risk of stale reads, without reducing the high availability.

This article covers a simple HOW-TO for Percona XtraDB Cluster 8.0 (PXC) and ProxySQL, providing an easy to follow guide to obtain no stale reads, without the need to renounce at read, scaling or a high grade of HA thanks to PXC8.

The Architecture

The covered architecture is based on:

  • PXC8 cluster compose by 3 nodes
  • ProxySQL v2 node in a cluster to avoid a single point of failure
  • Virtual IP with KeepAlived see here. If you prefer to use your already-existing load balancer, feel free to do so.
  • N number of application nodes, referring to VIP

Installation


Install PXC8

Install ProxySQL

And finally, set the virtual IP as illustrated in the article mentioned above. It is now the time to do the first step towards the non-stale read solution.

 

Covering Stale Reads


With PXC, we can easily prevent stale reads by setting the parameter to one of the following values wsrep-sync-wait = 1 – 3 – 5 or 7 (default = 0).
We will see what changes in more detail in part 3 of the blog to be published soon.
For now, just set it to wsrep-sync-wait = 1 ;.

The cluster will ensure consistent reads no matter from which node you will write and read.

This is it. So simple!

 

ProxySQL Requirements


The second step is to be sure we set up our ProxySQL nodes to use:

  • One writer a time to reduce the certification conflicts and Brutal Force Abort
  • Avoid including the writer in the reader group
  • Respect the order I am setting for failover in case of needs

Now here we have a problem; ProxySQL v2 comes with very interesting features like SSL Frontend/backend, support for AWS Aurora …and more. But it also comes with a very poor native PXC support. I have already raised this in my old article on February 19, 2019, and raised other issues with discussions and bug reports.

In short, we cannot trust ProxySQL for a few factors:

  • The way it deals with the nodes failover/failback is not customizable
  • The order of the nodes is not customizable
  • As of this writing, the support to have the writer NOT working as a reader is broken

In the end, the reality is that in order to support PXC/Galera, the use of an external script using the scheduler is more flexible, solid, and trustworthy. As such, the decision is to ignore the native Galera support, and instead focus on the implementation of a more robust script.

For the scope of this article, I have reviewed, updated, and extended my old script.

Percona had also developed a Galera checker script that was part of the ProxySQL-Admin-Tools suite, but that now has been externalized and available in the PerconaLab GitHub.

 

Setting All Blocks

The setup for this specific case will be based on:

  • Rules to perform read-write split.
  • One host group to define the writer HG 200
  • One host group to define the reader HG 201
  • One host group to define candidate writers HG 8200
  • One host group to define candidate readers HG 8201

The final architecture will look like this:

ProxySQL Nodes:

Node1 192.168.4.191 public ip
10.0.0.191 internal ip
Node1 192.168.4.192 public ip
10.0.0.192 internal ip
Node1 192.168.4.193 public ip
10.0.0.193 internal ip

VIP 192.168.4.194 public ip

PXC8 Nodes:

pxc1 10.0.0.22
pxc2 10.0.0.23
pxc3 10.0.0.33

Let us configure PXC8 first. Operation one is to create the users for ProxySQL and the script to access the PXC cluster for monitoring.

CREATE USER monitor@'10.0.%' IDENTIFIED BY '';
GRANT USAGE ON *.* TO monitor@'10.0.%';
GRANT SELECT ON performance_schema.* TO monitor@'10.0.%';

CREATE USER monitor@'10.0.%' IDENTIFIED BY '';
GRANT USAGE ON *.* TO monitor@'10.0.%';
GRANT SELECT ON performance_schema.* TO monitor@'10.0.%';

The second step is to configure ProxySQL as a cluster:

Add a user able to connect from remote. This is will require ProxySQL nodes to be restarted.

update global_variables set Variable_Value='admin:admin;cluster1:clusterpass'  where Variable_name='admin-admin_credentials';
SAVE ADMIN VARIABLES TO DISK;

systemctl restart proxysql.

On rotation, do all ProxySQL nodes.

The third part is to set the variables below.

Please note that the value for admin-cluster_mysql_servers_diffs_before_sync is not standard and is set to 1.

 

update global_variables set variable_value='cluster1' where variable_name='admin-cluster_username';
update global_variables set variable_value='clusterpass' where variable_name='admin-cluster_password';

update global_variables set variable_value=1 where variable_name='admin-cluster_mysql_servers_diffs_before_sync';
update global_variables set Variable_Value=0  where Variable_name='mysql-hostgroup_manager_verbose';
update global_variables set Variable_Value='true'  where Variable_name='mysql-query_digests_normalize_digest_text';
update global_variables set Variable_Value='8.0.19'  where Variable_name='mysql-server_version';
LOAD ADMIN VARIABLES TO RUN;SAVE ADMIN VARIABLES TO DISK;
LOAD MYSQL VARIABLES TO RUN;SAVE MYSQL VARIABLES TO DISK;

It is now time to define the ProxySQL cluster nodes:;

INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.191',6032,100,'PRIMARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.192',6032,100,'SECONDARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.193',6032,100,'SECONDARY');
LOAD PROXYSQL SERVERS TO RUN;SAVE PROXYSQL SERVERS TO DISK;

Check the ProxySQL logs and you should see that the nodes are now linked: 2020-05-25 09:24:30 [INFO] Cluster: clustering with peer 192.168.4.192:6032 . Remote version: 2.1.0-159-g0bdaa0b . Self version: 2.1.0-159-g0bdaa0b 2020-05-25 09:24:30 [INFO] Cluster: clustering with peer 192.168.4.193:6032 . Remote version: 2.1.0-159-g0bdaa0b . Self version: 2.1.0-159-g0bdaa0b

 

Once this is done let us continue the setup, adding the PXC nodes and all the different host groups to manage the architecture:

delete from mysql_servers where hostgroup_id in (200,201);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',200,3306,10000,2000,'default writer');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.23',201,3306,10000,2000,'reader');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.33',201,3306,10000,2000,'reader');        
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;    
    
delete from mysql_servers where hostgroup_id in (8200,8201);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',8200,3306,1000,2000,'Writer preferred');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.23',8200,3306,999,2000,'Second preferred');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.33',8200,3306,998,2000,'Thirdh and last in the list');      
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',8201,3306,1000,2000,'reader setting');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.23',8201,3306,1000,2000,'reader setting');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.33',8201,3306,1000,2000,'reader setting');       
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

You can see that as mentioned we have two host groups to manage the cluster 8200 and 8201.

Those two host groups work as templates and they will change only by us manually.

The 8200 host group weight defines the order of the writers from higher to lower.

Given that node 10.0.0.22 with weight 1000 is the preferred writer.

At the moment of writing, I chose to NOT implement automatic fail-back.

I will illustrate later how to trigger that manually.

Once we have all the servers up, lets’ move on and create the users:

 

insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_test2','test',1,200,'mysql',1,'application test user');
insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('dba','dbapw',1,200,'mysql',1,'generic dba for application');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

And the query rules to have Read/Write split:

insert into mysql_query_rules (rule_id,proxy_port,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,200,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,destination_hostgroup,active,retries,match_digest,apply) values(1042,6033,201,1,3,'^SELECT.*$',1);
LOAD MYSQL QUERY RULES TO RUN;SAVE MYSQL QUERY RULES TO DISK;

The final step is to set the scheduler:

INSERT  INTO scheduler (id,active,interval_ms,filename,arg1) values (10,0,2000,"/var/lib/proxysql/galera_check.pl","-u=cluster1 -p=clusterpass -h=127.0.0.1 -H=200:W,201:R -P=6032  --main_segment=1 --debug=0  --log=/var/lib/proxysql/galeraLog --active_failover=1 --single_writer=1 --writer_is_also_reader=0");
LOAD SCHEDULER TO RUNTIME;SAVE SCHEDULER TO DISK;

 
Let analyze the script parameters:

The schedule ID. id: 10
As a best practice, always keep the scheduler script not active by default and enable it only when in the need. active: 0
Interval is how often the scheduler should execute the script; it needs to be often enough to reduce the time the service is in a degraded state, but not so often to be noisy. An interval of two seconds is normally a good start. interval_ms: 2000
The location of the script that must be set as executable filename: /var/lib/proxysql/galera_check.pl

Given the scheduler limitation to five arguments, we collapse all the parameters in one and let the script then parse them. arg1: -u=cluster1 -p=clusterpass -h=127.0.0.1 -H=200:W,201:R -P=6032 –retry_down=2 –retry_up=1 –main_segment=2 –debug=0 –log=/var/lib/proxysql/galeraLog –active_failover=1 –single_writer=1 –writer_is_also_reader=0

The parameters we pass here are:

The credential to connect to ProxySQL: -u=cluster1 -p=clusterpass -h=127.0.0.1 -P=6032
The host group definition: -H=200:W,201:R This setting is necessary because you can have multiple script running serving multiple clusters.
The retry settings are to reduce the risk of false positive, say a network hiccup or other momentary events against which you do not want to take action: –retry_down=2 –retry_up=1

Given the script is segment-aware, you need to declare the main segment that is serving the applications: –main_segment=1 
Log location/name the final name will be the combination of this plus the host groups (ie galeraLog_200_W_201_R.log ) : –log=/var/lib/proxysql/galeraLog
If script should deal with failover or not and what type (read documentation/help for details): –active_failover=1
If the script should support SINGLE writer (default recommended), or multiple writer nodes: –single_writer=1
Is (are) the writers also working as readers or fully write dedicated: –writer_is_also_reader=0
Once we are confident our settings are right, let us put the script in production: 

update scheduler set active=1 where id=10;
LOAD SCHEDULER TO RUNTIME;
 

redflagWarning

One important thing to keep in mind is that ProxySQL scheduler IS NOT part of the cluster synchronization, as such we must manually configure that part on each node. Once the script runs, any change done inside ProxySQL to the mysql_server table will be kept in sync by the ProxySQL cluster. It is strongly recommended to not mix ProxySQL nodes in the cluster and sparse one, as this may cause unexpected behavior.

Conclusions

At this point, your PXC8 cluster architecture is fully running and will provide you with a very high level of HA and write isolation while preserving the read scaling capabilities.

In part two of this post, we will see the cluster in action and how it behaves in case of standard operations like backup or emergency cases like node crashes.

 

Continue in part 2 

 

 

 

 

What you may not know about random number generation in sysbench

Sysbench is a well known and largely used tool to perform benchmarking. Originally written by Peter Zaitsev in early 2000, it has become a de facto standard when performing testing and benchmarking. Nowadays it is maintained by Alexey Kopytov and can be found in Github at https://github.com/akopytov/sysbench.

What I have noticed though, is that while widely-used, some aspects of sysbench are not really familiar to many. For instance, the easy way to expand/modify the MySQL tests is using the lua extension, or the embedded way it handles the random number generation.

Why this article? 

I wrote this article with the intent to show how easy it can be to customize sysbench to make it what you need. There are many different ways to extend sysbench use, and one of these is through proper tuning of the random IDs generation.

By default, sysbench comes with five different methods to generate random numbers. But very often, (in fact, almost all the time), none is explicitly defined, and even more rare is seeing some parametrization when the method allows it.

If you wonder “Why should I care? Most of the time defaults are good”, well, this blog post is intended to help you understand why this may be not true.

 

Let us start.

What methods do we have in sysbench to generate numbers? Currently the following are implemented and you can easily check them invoking the --help option in sysbench:

  • Special 
  • Gaussian
  • Pareto
  • Zipfian 
  • Uniform

 

Of them Special is the default with the following parameters:

  • rand-spec-iter=12   number of iterations for the special distribution [12]
  • rand-spec-pct=1    percentage of the entire range where 'special' values will fall in the special distribution [1]
  • rand-spec-res=75    percentage of 'special' values to use for the special distribution [75]

 

Given I like to have simple and easy reproducible tests and scenarios, all the following data has being collected using the sysbench commands:

  •  sysbench ./src/lua/oltp_read.lua --mysql_storage_engine=innodb --db-driver=mysql --tables=10 --table_size=100 prepare
  • sysbench ./src/lua/oltp_read_write.lua --db-driver=mysql --tables=10 --table_size=100   --skip_trx=off --report-interval=1 --mysql-ignore-errors=all --mysql_storage_engine=innodb --auto_inc=on --histogram --stats_format=csv --db-ps-mode=disable --threads=10 --time=60  --rand-type=XXX run

 

Feel free to play by yourself with script instruction and data here (https://github.com/Tusamarco/blogs/tree/master/sysbench_random).

 

What is sysbench doing with the random number generator? Well, one of the ways it is used is to generate the IDs to be used in the query generation. So for instance in our case, it will look for numbers between 1 and 100, given we have 10 tables with 100 rows each.

What will happen if I run the sysbench RUN command as above, and change only the random –rand-type?

I have run the script and used the general log to collect/parse the generated IDs and count their frequencies, and here we go:

 

Special

Picture 1

Uniform

Picture 2

Zipfian

Picture 3

Pareto

Picture 4

Gaussian

Picture 5

 

Makes a lot of sense right? Sysbench is, in the end, doing exactly what we were expecting.

Let us check one by one and do some reasoning around them.

Special

The default is Special, so whenever you DO NOT specify a random-type to use, sysbench will use special. What special does is to use a very, very limited number of IDs for the query operations. Here we can actually say it will mainly use IDs 50-51 and very sporadically a set between 44-56, and the others are practically irrelevant. Please note, the values chosen are in the middle range of the available set 1-100.

In this case, the spike is focused on two IDs representing 2 percent of the sample. If I increase the number of records to one million, the spike still exists and is focused on 7493, which is 0.74% of the sample. Given that’s even more restrictive, the number of pages will probably be more than one.

Uniform

As declared by the name, if we use Uniform, all the values are going to be used for the IDs and the distribution will be … Uniform.

Zipfian

The Zipf distribution, sometimes referred to as the zeta distribution, is a discrete distribution commonly used in linguistics, insurance, and the modeling of rare events. In this case, sysbench will use a set of numbers starting from the lower (1) and reducing the frequency in a very fast way while moving towards bigger numbers.

Pareto

With Pareto that applies the rule of 80-20 (read https://en.wikipedia.org/wiki/Pareto_distribution), the IDs we will use are even less distributed and more concentrated in a small segment. 52 percent of all IDs used were using the number 1, while 73 percent of IDs used were in the first 10 numbers.

Gaussian

Gaussian distribution (or normal distribution) is well known and familiar (see https://en.wikipedia.org/wiki/Normal_distribution) and mostly used in statistics and prediction around a central factor. In this case, the used IDs are distributed in a bell curve starting from the mid-value and slowly decreases towards the edges.

The point now is, what for?

Each one of the above cases represents something, and if we want to group them we can say that Pareto and Special can be focused on hot-spots. In that case, an application is using the same page/data over and over. This can be fine, but we need to know what we are doing and be sure we do not end up there by mistake.

For instance, IF we are testing the efficiency of InnoDB page compression in read, we should avoid using the Special or Pareto default, which means we must change sysbench defaults. This is in case we have a dataset of 1Tb and bufferpool of 30Gb, and we query over and over the same page. That page was already read from the disk-uncompressed-available in memory.

In short, our test is a waste of time/effort.

Same if we need to check the efficiency in writing. Writing the same page over and over is not a good way to go.

What about testing the performance?

Well again, are we looking to identify the performance, and against what case? It is important to understand that using a different random-type WILL impact your test dramatically. So your “defaults should be good enough” may be totally wrong.

The following graphs represent differences existing when changing ONLY the rand-type value, test type, time, additional option, and the number of threads are exactly the same.

Latency differs significantly from type to type:

Picture 9    

Here I was doing read and write, and data comes from the Performance Schema query by sys schema (sys.schema_table_statistics). As expected, Pareto and Special are taking much longer than the others given the system (MySQL-InnoDB) is artificially suffering for contention on one hot spot.

Changing the rand-type affects not only latency but also the number of processed rows, as reported by the performance schema.

Picture 10

Picture 11

 

Given all the above, it is important to classify what we are trying to determine, and what we are testing.

If my scope is to test the performance of a system, at all levels, I may prefer to use Uniform, which will equally stress the dataset/DB Server/System and will have more chances to read/load/write all over the place.

If my scope is to identify how to deal with hot-spots, then probably Pareto and Special are the right choices.

But when doing that, do not go blind with the defaults. Defaults may be good, but they are probably recreating edge cases. That is my personal experience, and in that case, you can use the parameters to tune it properly.

For instance, you may still want to have sysbench hammering using the values in the middle, but you want to relax the interval so that it will not look like a spike (Special-default) but also not a bell curve (Gaussian).

You can customize Special and have something like :

Picture 6

In this case, the IDs are still grouped and we still have possible contention, but less impact by a single hot-spot, so the range of possible contention is now on a set of IDs that can be on multiple pages, depending on the number of records by page.

Another possible test case is based on Partitioning. If, for instance, you want to test how your system will work with partitions and focus on the latest live data while archiving the old one, what can you do?

Easy! Remember the graph of the Pareto distribution? You can modify that as well to fit your needs.

Picture 8

Just tuning the –rand-pareto value, you can easily achieve exactly what you were looking for and have sysbench focus the queries on the higher values of the IDs.

Zipfian can also be tuned, and while you cannot obtain an inversion as with Pareto, you can easily get from spiking on one value to equally distributed scenarios. A good example is the following:

Picture 7

 

The last thing to keep in mind, and it looks to me that I am stating the obvious but better to say that than omit it, is that while you change the random specific parameters, the performance will also change.

See latency details:

Picture 12

Here you can see in green the modified values compared with the original in blue.

 

Picture 13

 

Conclusion

At this point, you should have realized how easy it can be to adjust the way sysbench works/handles the random generation, and how effective it can be to match your needs. Keep in mind that what I have mentioned above is valid for any call like the following, such as when we use the sysbench.rand.default call:

local function get_id()

   return sysbench.rand.default(1, sysbench.opt.table_size)

End

Given that, do not just copy and paste strings from other people’s articles, think and understand what you need and how to achieve it.

Before running your tests, check the random method/settings to see how it comes up and if it fits your needs. To make it simpler for me, I use this simple test (https://github.com/Tusamarco/sysbench/blob/master/src/lua/test_random.lua). The test runs and will print a quite clear representation of the IDs distribution.

My recommendation is, identify what matches your needs and do your testing/benchmarking in the right way.

References

First and foremost reference is for the great work Alexey Kopytov is doing in working on sysbench https://github.com/akopytov/sysbench

Zipfian articles:

Pareto:

Percona article on how to extend tests in sysbench https://www.percona.com/blog/2019/04/25/creating-custom-sysbench-scripts/

The whole set material I used for this article is on github (https://github.com/Tusamarco/blogs/tree/master/sysbench_random)

Latest conferences

PL2020
percona_tech_days_aug_2020

We have 73 guests and no members online

oracle_ace